Contents
Comparing Variances: Levene's Test, Bartlett's Test, and the F-Test
When you need to test whether two or more groups have equal variances. Covers Levene's test, Bartlett's test, Brown-Forsythe, and when each is appropriate.
Quick Hits
- •Levene's test is your default—it's robust to non-normality
- •Brown-Forsythe (Levene's with median) is even more robust
- •Bartlett's test requires normality and is rarely the best choice
- •The classic F-test for variances is extremely sensitive to non-normality—avoid it
TL;DR
To compare variances across groups, use Levene's test (with median centering, called Brown-Forsythe) as your default—it's robust to non-normality. Bartlett's test is more powerful but requires normality. The classic F-test for comparing two variances is extremely sensitive to non-normality and should be avoided. For most analysts, variance testing matters for quality control or understanding treatment effects on spread, not as a preliminary step before t-tests.
When to Compare Variances
Legitimate Uses
- Quality control: Has process variability increased after a change?
- Measurement precision: Is one instrument more consistent than another?
- Treatment effects on spread: Does treatment affect variability, not just average?
- Understanding distributions: As exploratory analysis
Avoid Using For
Preliminary test before t-tests: Just use Welch's t-test directly. The two-stage procedure (test variance → choose t-test) has worse statistical properties than always using Welch's.
Method 1: Levene's Test (with Mean)
Tests whether group variances are equal by comparing mean absolute deviations from group means.
How It Works
- Calculate $z_{ij} = |x_{ij} - \bar{x}_i|$ (absolute deviation from group mean)
- Perform ANOVA on the $z_{ij}$ values
from scipy import stats
import numpy as np
def levene_test(group1, group2, center='mean'):
"""
Levene's test for equality of variances.
center: 'mean' (original Levene's) or 'median' (Brown-Forsythe)
"""
stat, p_value = stats.levene(group1, group2, center=center)
return {
'statistic': stat,
'p_value': p_value,
'var_group1': np.var(group1, ddof=1),
'var_group2': np.var(group2, ddof=1),
'ratio': np.var(group1, ddof=1) / np.var(group2, ddof=1)
}
# Example
np.random.seed(42)
group1 = np.random.normal(0, 1, 100) # SD = 1
group2 = np.random.normal(0, 2, 100) # SD = 2
result = levene_test(group1, group2, center='mean')
print(f"Variance ratio: {result['ratio']:.2f}")
print(f"Levene's test p-value: {result['p_value']:.4f}")
R Implementation
# Levene's test in R
library(car)
leveneTest(value ~ group, data = df)
# Or using base R
var.test(group1, group2) # F-test (not recommended)
Method 2: Brown-Forsythe Test
Levene's test using median instead of mean. More robust to skewed distributions.
Implementation
def brown_forsythe_test(group1, group2):
"""
Brown-Forsythe test (Levene's with median).
More robust than original Levene's.
"""
return levene_test(group1, group2, center='median')
# Compare Levene's and Brown-Forsythe with skewed data
np.random.seed(42)
group1_skew = np.random.exponential(1, 100)
group2_skew = np.random.exponential(2, 100)
levene_result = levene_test(group1_skew, group2_skew, center='mean')
bf_result = brown_forsythe_test(group1_skew, group2_skew)
print("With skewed data:")
print(f"Levene's (mean) p-value: {levene_result['p_value']:.4f}")
print(f"Brown-Forsythe (median) p-value: {bf_result['p_value']:.4f}")
When to Prefer Brown-Forsythe
- Non-normal data
- Skewed distributions
- Heavy tails
- As the conservative default choice
Method 3: Bartlett's Test
More powerful than Levene's when data is normally distributed, but sensitive to non-normality.
Implementation
def bartlett_test(group1, group2):
"""
Bartlett's test for equality of variances.
Assumes normality.
"""
stat, p_value = stats.bartlett(group1, group2)
return {
'statistic': stat,
'p_value': p_value,
'var_group1': np.var(group1, ddof=1),
'var_group2': np.var(group2, ddof=1)
}
# With normal data
np.random.seed(42)
group1_normal = np.random.normal(0, 1, 100)
group2_normal = np.random.normal(0, 2, 100)
bartlett_result = bartlett_test(group1_normal, group2_normal)
levene_result = levene_test(group1_normal, group2_normal)
print("With normal data:")
print(f"Bartlett's p-value: {bartlett_result['p_value']:.4f}")
print(f"Levene's p-value: {levene_result['p_value']:.4f}")
The Non-Normality Problem
# Demonstrate Bartlett's sensitivity to non-normality
np.random.seed(42)
# Two groups with EQUAL variances but non-normal distribution
group1 = np.random.exponential(1, 100)
group2 = np.random.exponential(1, 100) # Same distribution!
bartlett_result = bartlett_test(group1, group2)
levene_result = levene_test(group1, group2, center='median')
print("Equal variances, non-normal data:")
print(f"True variance ratio: {np.var(group1)/np.var(group2):.2f}")
print(f"Bartlett's p-value: {bartlett_result['p_value']:.4f}") # May falsely reject!
print(f"Brown-Forsythe p-value: {levene_result['p_value']:.4f}") # Correct
Bartlett's test can reject equal variances when variances ARE equal but data isn't normal. This makes it unreliable in practice.
Method 4: F-Test (Variance Ratio Test)
The classical test comparing two variances. Avoid this test.
Why to Avoid
The F-test is extremely sensitive to non-normality—even more so than Bartlett's. Small departures from normality cause large inflation of Type I error.
def f_test_variances(group1, group2):
"""
Classical F-test for comparing variances.
WARNING: Very sensitive to non-normality.
"""
var1 = np.var(group1, ddof=1)
var2 = np.var(group2, ddof=1)
# F-statistic (larger variance in numerator)
if var1 >= var2:
f_stat = var1 / var2
df1, df2 = len(group1) - 1, len(group2) - 1
else:
f_stat = var2 / var1
df1, df2 = len(group2) - 1, len(group1) - 1
p_value = 2 * min(stats.f.cdf(f_stat, df1, df2),
1 - stats.f.cdf(f_stat, df1, df2))
return {
'f_statistic': f_stat,
'p_value': p_value,
'df1': df1,
'df2': df2
}
Simulation of Type I Error
def simulate_ftest_type1(distribution='normal', n=50, n_sims=10000):
"""Simulate Type I error under equal variances."""
rejections = 0
for _ in range(n_sims):
if distribution == 'normal':
g1 = np.random.normal(0, 1, n)
g2 = np.random.normal(0, 1, n)
else: # exponential
g1 = np.random.exponential(1, n)
g2 = np.random.exponential(1, n)
result = f_test_variances(g1, g2)
if result['p_value'] < 0.05:
rejections += 1
return rejections / n_sims
normal_type1 = simulate_ftest_type1('normal')
exp_type1 = simulate_ftest_type1('exponential')
print(f"F-test Type I error (normal data): {normal_type1:.3f}")
print(f"F-test Type I error (exponential): {exp_type1:.3f}") # Will be much higher!
Multiple Groups
All these tests extend to more than two groups:
# Levene's with multiple groups
group1 = np.random.normal(0, 1, 50)
group2 = np.random.normal(0, 1.5, 50)
group3 = np.random.normal(0, 2, 50)
stat, p_value = stats.levene(group1, group2, group3, center='median')
print(f"Levene's test (3 groups) p-value: {p_value:.4f}")
# Bartlett's with multiple groups
stat, p_value = stats.bartlett(group1, group2, group3)
print(f"Bartlett's test (3 groups) p-value: {p_value:.4f}")
Decision Guide
| Situation | Recommended Test |
|---|---|
| General use | Brown-Forsythe (Levene's with median) |
| Known normal data | Bartlett's |
| Skewed data | Brown-Forsythe |
| Heavy tails | Brown-Forsythe |
| Multiple groups | Levene's/Brown-Forsythe or Bartlett's |
| Classic textbook | F-test (but don't actually use it) |
Simple Rule
Just use Brown-Forsythe (Levene's with center='median'). It's robust and rarely wrong.
Effect Size: Variance Ratio
When reporting, include the variance ratio as an effect size:
def variance_comparison_report(group1, group2, alpha=0.05):
"""
Complete variance comparison with effect size.
"""
var1 = np.var(group1, ddof=1)
var2 = np.var(group2, ddof=1)
# Brown-Forsythe test
stat, p_value = stats.levene(group1, group2, center='median')
# Variance ratio (and its reciprocal for interpretation)
ratio = var1 / var2 if var1 >= var2 else var2 / var1
# Confidence interval for variance ratio (approximate)
n1, n2 = len(group1), len(group2)
log_ratio = np.log(var1 / var2)
se_log_ratio = np.sqrt(2/(n1-1) + 2/(n2-1))
ci_log = (log_ratio - 1.96*se_log_ratio, log_ratio + 1.96*se_log_ratio)
ci_ratio = (np.exp(ci_log[0]), np.exp(ci_log[1]))
return {
'var_group1': var1,
'var_group2': var2,
'variance_ratio': ratio,
'ci_95_ratio': ci_ratio,
'bf_p_value': p_value,
'significant': p_value < alpha
}
result = variance_comparison_report(group1, group2)
print(f"Variance ratio: {result['variance_ratio']:.2f}")
print(f"95% CI: [{result['ci_95_ratio'][0]:.2f}, {result['ci_95_ratio'][1]:.2f}]")
print(f"Brown-Forsythe p-value: {result['bf_p_value']:.4f}")
Related Methods
- Picking the Right Test to Compare Two Groups — Complete decision framework
- Welch's T-Test vs. Student's T-Test — Why variance testing isn't needed for t-tests
- Equal Variance and Welch's T-Test: When It Matters — Deeper dive on the assumption
Key Takeaway
Use Levene's test or Brown-Forsythe to compare variances—they're robust to non-normality. Avoid the classic F-test (variance ratio) which is extremely sensitive to distribution shape. And remember: testing variances as a preliminary step before t-tests is unnecessary if you just use Welch's t-test, which handles both equal and unequal variances correctly.
References
- https://www.jstor.org/stable/2530779
- https://www.jstor.org/stable/2528930
- Levene, H. (1960). Robust tests for equality of variances. In *Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling*, 278-292.
- Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances. *Journal of the American Statistical Association*, 69(346), 364-367.
- Box, G. E. P. (1953). Non-normality and tests on variances. *Biometrika*, 40(3/4), 318-335.
Frequently Asked Questions
Should I test for equal variances before choosing between t-tests?
When would I actually want to compare variances?
Which test should I use?
Key Takeaway
Use Levene's test or Brown-Forsythe to compare variances—they're robust to non-normality. Avoid the classic F-test (variance ratio) which is extremely sensitive to distribution shape. And remember: testing variances as a preliminary step before t-tests is unnecessary if you just use Welch's.