Two-Group Comparisons

Comparing Variances: Levene's Test, Bartlett's Test, and the F-Test

When you need to test whether two or more groups have equal variances. Covers Levene's test, Bartlett's test, Brown-Forsythe, and when each is appropriate.

Share

Quick Hits

  • Levene's test is your default—it's robust to non-normality
  • Brown-Forsythe (Levene's with median) is even more robust
  • Bartlett's test requires normality and is rarely the best choice
  • The classic F-test for variances is extremely sensitive to non-normality—avoid it

TL;DR

To compare variances across groups, use Levene's test (with median centering, called Brown-Forsythe) as your default—it's robust to non-normality. Bartlett's test is more powerful but requires normality. The classic F-test for comparing two variances is extremely sensitive to non-normality and should be avoided. For most analysts, variance testing matters for quality control or understanding treatment effects on spread, not as a preliminary step before t-tests.


When to Compare Variances

Legitimate Uses

  1. Quality control: Has process variability increased after a change?
  2. Measurement precision: Is one instrument more consistent than another?
  3. Treatment effects on spread: Does treatment affect variability, not just average?
  4. Understanding distributions: As exploratory analysis

Avoid Using For

Preliminary test before t-tests: Just use Welch's t-test directly. The two-stage procedure (test variance → choose t-test) has worse statistical properties than always using Welch's.


Method 1: Levene's Test (with Mean)

Tests whether group variances are equal by comparing mean absolute deviations from group means.

How It Works

  1. Calculate $z_{ij} = |x_{ij} - \bar{x}_i|$ (absolute deviation from group mean)
  2. Perform ANOVA on the $z_{ij}$ values
from scipy import stats
import numpy as np

def levene_test(group1, group2, center='mean'):
    """
    Levene's test for equality of variances.

    center: 'mean' (original Levene's) or 'median' (Brown-Forsythe)
    """
    stat, p_value = stats.levene(group1, group2, center=center)

    return {
        'statistic': stat,
        'p_value': p_value,
        'var_group1': np.var(group1, ddof=1),
        'var_group2': np.var(group2, ddof=1),
        'ratio': np.var(group1, ddof=1) / np.var(group2, ddof=1)
    }


# Example
np.random.seed(42)
group1 = np.random.normal(0, 1, 100)   # SD = 1
group2 = np.random.normal(0, 2, 100)   # SD = 2

result = levene_test(group1, group2, center='mean')
print(f"Variance ratio: {result['ratio']:.2f}")
print(f"Levene's test p-value: {result['p_value']:.4f}")

R Implementation

# Levene's test in R
library(car)
leveneTest(value ~ group, data = df)

# Or using base R
var.test(group1, group2)  # F-test (not recommended)

Method 2: Brown-Forsythe Test

Levene's test using median instead of mean. More robust to skewed distributions.

Implementation

def brown_forsythe_test(group1, group2):
    """
    Brown-Forsythe test (Levene's with median).
    More robust than original Levene's.
    """
    return levene_test(group1, group2, center='median')


# Compare Levene's and Brown-Forsythe with skewed data
np.random.seed(42)
group1_skew = np.random.exponential(1, 100)
group2_skew = np.random.exponential(2, 100)

levene_result = levene_test(group1_skew, group2_skew, center='mean')
bf_result = brown_forsythe_test(group1_skew, group2_skew)

print("With skewed data:")
print(f"Levene's (mean) p-value: {levene_result['p_value']:.4f}")
print(f"Brown-Forsythe (median) p-value: {bf_result['p_value']:.4f}")

When to Prefer Brown-Forsythe

  • Non-normal data
  • Skewed distributions
  • Heavy tails
  • As the conservative default choice

Method 3: Bartlett's Test

More powerful than Levene's when data is normally distributed, but sensitive to non-normality.

Implementation

def bartlett_test(group1, group2):
    """
    Bartlett's test for equality of variances.
    Assumes normality.
    """
    stat, p_value = stats.bartlett(group1, group2)

    return {
        'statistic': stat,
        'p_value': p_value,
        'var_group1': np.var(group1, ddof=1),
        'var_group2': np.var(group2, ddof=1)
    }


# With normal data
np.random.seed(42)
group1_normal = np.random.normal(0, 1, 100)
group2_normal = np.random.normal(0, 2, 100)

bartlett_result = bartlett_test(group1_normal, group2_normal)
levene_result = levene_test(group1_normal, group2_normal)

print("With normal data:")
print(f"Bartlett's p-value: {bartlett_result['p_value']:.4f}")
print(f"Levene's p-value: {levene_result['p_value']:.4f}")

The Non-Normality Problem

# Demonstrate Bartlett's sensitivity to non-normality
np.random.seed(42)

# Two groups with EQUAL variances but non-normal distribution
group1 = np.random.exponential(1, 100)
group2 = np.random.exponential(1, 100)  # Same distribution!

bartlett_result = bartlett_test(group1, group2)
levene_result = levene_test(group1, group2, center='median')

print("Equal variances, non-normal data:")
print(f"True variance ratio: {np.var(group1)/np.var(group2):.2f}")
print(f"Bartlett's p-value: {bartlett_result['p_value']:.4f}")  # May falsely reject!
print(f"Brown-Forsythe p-value: {levene_result['p_value']:.4f}")  # Correct

Bartlett's test can reject equal variances when variances ARE equal but data isn't normal. This makes it unreliable in practice.


Method 4: F-Test (Variance Ratio Test)

The classical test comparing two variances. Avoid this test.

Why to Avoid

The F-test is extremely sensitive to non-normality—even more so than Bartlett's. Small departures from normality cause large inflation of Type I error.

def f_test_variances(group1, group2):
    """
    Classical F-test for comparing variances.
    WARNING: Very sensitive to non-normality.
    """
    var1 = np.var(group1, ddof=1)
    var2 = np.var(group2, ddof=1)

    # F-statistic (larger variance in numerator)
    if var1 >= var2:
        f_stat = var1 / var2
        df1, df2 = len(group1) - 1, len(group2) - 1
    else:
        f_stat = var2 / var1
        df1, df2 = len(group2) - 1, len(group1) - 1

    p_value = 2 * min(stats.f.cdf(f_stat, df1, df2),
                      1 - stats.f.cdf(f_stat, df1, df2))

    return {
        'f_statistic': f_stat,
        'p_value': p_value,
        'df1': df1,
        'df2': df2
    }

Simulation of Type I Error

def simulate_ftest_type1(distribution='normal', n=50, n_sims=10000):
    """Simulate Type I error under equal variances."""
    rejections = 0

    for _ in range(n_sims):
        if distribution == 'normal':
            g1 = np.random.normal(0, 1, n)
            g2 = np.random.normal(0, 1, n)
        else:  # exponential
            g1 = np.random.exponential(1, n)
            g2 = np.random.exponential(1, n)

        result = f_test_variances(g1, g2)
        if result['p_value'] < 0.05:
            rejections += 1

    return rejections / n_sims


normal_type1 = simulate_ftest_type1('normal')
exp_type1 = simulate_ftest_type1('exponential')

print(f"F-test Type I error (normal data): {normal_type1:.3f}")
print(f"F-test Type I error (exponential): {exp_type1:.3f}")  # Will be much higher!

Multiple Groups

All these tests extend to more than two groups:

# Levene's with multiple groups
group1 = np.random.normal(0, 1, 50)
group2 = np.random.normal(0, 1.5, 50)
group3 = np.random.normal(0, 2, 50)

stat, p_value = stats.levene(group1, group2, group3, center='median')
print(f"Levene's test (3 groups) p-value: {p_value:.4f}")

# Bartlett's with multiple groups
stat, p_value = stats.bartlett(group1, group2, group3)
print(f"Bartlett's test (3 groups) p-value: {p_value:.4f}")

Decision Guide

Situation Recommended Test
General use Brown-Forsythe (Levene's with median)
Known normal data Bartlett's
Skewed data Brown-Forsythe
Heavy tails Brown-Forsythe
Multiple groups Levene's/Brown-Forsythe or Bartlett's
Classic textbook F-test (but don't actually use it)

Simple Rule

Just use Brown-Forsythe (Levene's with center='median'). It's robust and rarely wrong.


Effect Size: Variance Ratio

When reporting, include the variance ratio as an effect size:

def variance_comparison_report(group1, group2, alpha=0.05):
    """
    Complete variance comparison with effect size.
    """
    var1 = np.var(group1, ddof=1)
    var2 = np.var(group2, ddof=1)

    # Brown-Forsythe test
    stat, p_value = stats.levene(group1, group2, center='median')

    # Variance ratio (and its reciprocal for interpretation)
    ratio = var1 / var2 if var1 >= var2 else var2 / var1

    # Confidence interval for variance ratio (approximate)
    n1, n2 = len(group1), len(group2)
    log_ratio = np.log(var1 / var2)
    se_log_ratio = np.sqrt(2/(n1-1) + 2/(n2-1))
    ci_log = (log_ratio - 1.96*se_log_ratio, log_ratio + 1.96*se_log_ratio)
    ci_ratio = (np.exp(ci_log[0]), np.exp(ci_log[1]))

    return {
        'var_group1': var1,
        'var_group2': var2,
        'variance_ratio': ratio,
        'ci_95_ratio': ci_ratio,
        'bf_p_value': p_value,
        'significant': p_value < alpha
    }


result = variance_comparison_report(group1, group2)
print(f"Variance ratio: {result['variance_ratio']:.2f}")
print(f"95% CI: [{result['ci_95_ratio'][0]:.2f}, {result['ci_95_ratio'][1]:.2f}]")
print(f"Brown-Forsythe p-value: {result['bf_p_value']:.4f}")


Key Takeaway

Use Levene's test or Brown-Forsythe to compare variances—they're robust to non-normality. Avoid the classic F-test (variance ratio) which is extremely sensitive to distribution shape. And remember: testing variances as a preliminary step before t-tests is unnecessary if you just use Welch's t-test, which handles both equal and unequal variances correctly.


References

  1. https://www.jstor.org/stable/2530779
  2. https://www.jstor.org/stable/2528930
  3. Levene, H. (1960). Robust tests for equality of variances. In *Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling*, 278-292.
  4. Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances. *Journal of the American Statistical Association*, 69(346), 364-367.
  5. Box, G. E. P. (1953). Non-normality and tests on variances. *Biometrika*, 40(3/4), 318-335.

Frequently Asked Questions

Should I test for equal variances before choosing between t-tests?
No. Just use Welch's t-test always—it handles both equal and unequal variances correctly. Testing variances first adds complexity without improving decisions.
When would I actually want to compare variances?
Quality control (is variability increasing?), comparing measurement precision, testing whether a treatment affects spread not just center, or as an ANOVA assumption check.
Which test should I use?
Levene's test (with median, i.e., Brown-Forsythe) for most cases. Bartlett's only if you're confident data is normally distributed.

Key Takeaway

Use Levene's test or Brown-Forsythe to compare variances—they're robust to non-normality. Avoid the classic F-test (variance ratio) which is extremely sensitive to distribution shape. And remember: testing variances as a preliminary step before t-tests is unnecessary if you just use Welch's.

Send to a friend

Share this with someone who loves clean statistical work.