Assumptions

Equal Variance and Welch's T-Test: When It Actually Matters

A deep dive into the equal variance assumption for t-tests and ANOVA. Learn when violations are problematic, how to detect them, and why Welch's correction should be your default.

Share

Quick Hits

  • Unequal variance matters more than non-normality for t-tests and ANOVA
  • The problem is worst when smaller groups have larger variance
  • Welch's t-test performs well whether variances are equal or not
  • Variance ratios above 3 warrant concern; above 4 definitely use Welch

TL;DR

Equal variance (homoscedasticity) matters more than normality for t-tests and ANOVA. When variances differ and sample sizes are unequal, the standard t-test's Type I error rate can be inflated or deflated depending on which group has larger variance. Welch's t-test corrects this and performs almost identically to Student's t-test when variances are equal. Make Welch your default.


Why Equal Variance Matters

The Pooled Variance Problem

The standard t-test pools variance from both groups:

$$s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}$$

This assumes $s_1^2 \approx s_2^2$. When they differ, this pooled estimate is wrong.

import numpy as np
from scipy import stats
import pandas as pd

def demonstrate_pooling_problem():
    """
    Show how pooled variance goes wrong with unequal variances.
    """
    np.random.seed(42)

    # Two groups with very different variances
    group1 = np.random.normal(50, 5, 30)   # SD = 5
    group2 = np.random.normal(50, 20, 30)  # SD = 20

    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)

    # Pooled variance
    n1, n2 = len(group1), len(group2)
    pooled_var = ((n1-1)*var1 + (n2-1)*var2) / (n1 + n2 - 2)

    print("Variance comparison:")
    print(f"  Group 1 variance: {var1:.2f}")
    print(f"  Group 2 variance: {var2:.2f}")
    print(f"  Variance ratio: {max(var1, var2)/min(var1, var2):.2f}")
    print(f"  Pooled variance: {pooled_var:.2f}")
    print()
    print("The pooled variance doesn't represent either group well!")


demonstrate_pooling_problem()

The Critical Interaction: Variance × Sample Size

def variance_samplesize_interaction(n_sims=10000):
    """
    Show how the variance-sample size interaction affects Type I error.
    """
    np.random.seed(42)

    scenarios = {
        'Equal var, equal n': {
            'n': [30, 30], 'sd': [10, 10]
        },
        'Unequal var, equal n': {
            'n': [30, 30], 'sd': [10, 30]
        },
        'Larger var in SMALLER group': {
            'n': [15, 45], 'sd': [30, 10]  # Type I error INFLATES
        },
        'Larger var in LARGER group': {
            'n': [45, 15], 'sd': [30, 10]  # Test becomes CONSERVATIVE
        }
    }

    results = {}
    for name, params in scenarios.items():
        student_reject = 0
        welch_reject = 0

        for _ in range(n_sims):
            # Both groups have same mean (null is true)
            g1 = np.random.normal(50, params['sd'][0], params['n'][0])
            g2 = np.random.normal(50, params['sd'][1], params['n'][1])

            _, p_student = stats.ttest_ind(g1, g2, equal_var=True)
            _, p_welch = stats.ttest_ind(g1, g2, equal_var=False)

            if p_student < 0.05:
                student_reject += 1
            if p_welch < 0.05:
                welch_reject += 1

        results[name] = {
            'Student': student_reject / n_sims,
            'Welch': welch_reject / n_sims,
            'var_ratio': max(params['sd'])**2 / min(params['sd'])**2,
            'n_ratio': max(params['n']) / min(params['n'])
        }

    print("Type I Error Rates (nominal α = 0.05)")
    print("=" * 65)
    print(f"{'Scenario':<35} {'Student':>10} {'Welch':>10} {'Var Ratio':>8}")
    print("-" * 65)
    for name, res in results.items():
        marker = " ⚠️" if abs(res['Student'] - 0.05) > 0.02 else ""
        print(f"{name:<35} {res['Student']:>10.3f} {res['Welch']:>10.3f} "
              f"{res['var_ratio']:>8.1f}{marker}")

    return results


variance_samplesize_interaction()

Detecting Unequal Variance

Visual Inspection

import matplotlib.pyplot as plt

def plot_variance_comparison(groups, group_names):
    """
    Visual comparison of group variances.
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # Box plots
    axes[0].boxplot(groups, labels=group_names)
    axes[0].set_ylabel('Value')
    axes[0].set_title('Box Plots (compare spread)')

    # Variance bar chart
    variances = [np.var(g, ddof=1) for g in groups]
    axes[1].bar(group_names, variances, color=['steelblue', 'darkorange'])
    axes[1].set_ylabel('Variance')
    axes[1].set_title(f'Variances (ratio = {max(variances)/min(variances):.2f})')

    # Sample sizes
    ns = [len(g) for g in groups]
    axes[2].bar(group_names, ns, color=['steelblue', 'darkorange'])
    axes[2].set_ylabel('Sample Size')
    axes[2].set_title('Sample Sizes')

    plt.tight_layout()
    return fig

Levene's Test

from scipy.stats import levene

def test_equal_variance(group1, group2, alpha=0.05):
    """
    Test for equal variances with practical interpretation.
    """
    # Levene's test (robust to non-normality)
    stat, p = levene(group1, group2, center='median')

    # Variance ratio
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    var_ratio = max(var1, var2) / min(var1, var2)

    # Sample sizes
    n1, n2 = len(group1), len(group2)
    n_ratio = max(n1, n2) / min(n1, n2)

    # Assess severity
    if var_ratio < 2:
        severity = 'Low'
        recommendation = 'Standard t-test OK, but Welch is safe'
    elif var_ratio < 3:
        severity = 'Moderate'
        recommendation = 'Use Welch\'s t-test'
    elif var_ratio < 4:
        severity = 'Substantial'
        recommendation = 'Definitely use Welch\'s t-test'
    else:
        severity = 'Severe'
        recommendation = 'Use Welch; consider transformation or rank test'

    # Extra warning for dangerous combination
    if var_ratio > 2 and n_ratio > 1.5:
        # Determine which group is smaller with larger variance
        smaller_group = 1 if n1 < n2 else 2
        larger_var_group = 1 if var1 > var2 else 2
        if smaller_group == larger_var_group:
            recommendation += '\n  ⚠️  WARNING: Smaller group has larger variance - Type I error inflated!'

    return {
        'levene_stat': stat,
        'levene_p': p,
        'var_ratio': var_ratio,
        'variances': (var1, var2),
        'sample_sizes': (n1, n2),
        'n_ratio': n_ratio,
        'severity': severity,
        'recommendation': recommendation
    }


# Example
np.random.seed(42)
g1 = np.random.normal(50, 5, 20)   # Smaller n, smaller variance
g2 = np.random.normal(50, 15, 50)  # Larger n, larger variance

result = test_equal_variance(g1, g2)
print("Variance Equality Assessment:")
print("-" * 40)
print(f"Variances: {result['variances'][0]:.2f}, {result['variances'][1]:.2f}")
print(f"Variance ratio: {result['var_ratio']:.2f}")
print(f"Sample sizes: {result['sample_sizes']}")
print(f"Levene's test: p = {result['levene_p']:.4f}")
print(f"Severity: {result['severity']}")
print(f"Recommendation: {result['recommendation']}")

Rule of Thumb Table

Variance Ratio With Equal n With Unequal n
< 2 Fine Usually fine
2-3 Fine Consider Welch
3-4 Consider Welch Definitely Welch
> 4 Use Welch Welch + caution

Welch's T-Test: The Solution

How It Works

Welch's t-test doesn't pool variances. It uses separate variance estimates:

$$t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

With Welch-Satterthwaite degrees of freedom:

$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$$

def compare_tests(group1, group2):
    """
    Compare Student's and Welch's t-test.
    """
    # Student's t-test
    t_student, p_student = stats.ttest_ind(group1, group2, equal_var=True)
    df_student = len(group1) + len(group2) - 2

    # Welch's t-test
    t_welch, p_welch = stats.ttest_ind(group1, group2, equal_var=False)

    # Calculate Welch df
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    n1, n2 = len(group1), len(group2)
    num = (var1/n1 + var2/n2)**2
    denom = (var1/n1)**2/(n1-1) + (var2/n2)**2/(n2-1)
    df_welch = num / denom

    print("Test Comparison:")
    print("-" * 50)
    print(f"{'':20} {'Student':>12} {'Welch':>12}")
    print("-" * 50)
    print(f"{'t-statistic':20} {t_student:>12.4f} {t_welch:>12.4f}")
    print(f"{'df':20} {df_student:>12.1f} {df_welch:>12.1f}")
    print(f"{'p-value':20} {p_student:>12.4f} {p_welch:>12.4f}")

    # Variance info
    print(f"\nVariance ratio: {max(var1, var2)/min(var1, var2):.2f}")
    print(f"Sample sizes: {n1}, {n2}")


# Example with very unequal variances
np.random.seed(42)
g1 = np.random.normal(50, 5, 25)
g2 = np.random.normal(55, 20, 40)

compare_tests(g1, g2)

Why Welch Should Be Default

def welch_as_default_justification(n_sims=10000):
    """
    Show that Welch never does much worse and often does better.
    """
    np.random.seed(42)

    scenarios = [
        # (n1, n2, sd1, sd2, true_diff)
        (30, 30, 10, 10, 5),   # Equal var, equal n
        (30, 30, 10, 30, 5),   # Unequal var, equal n
        (15, 45, 30, 10, 5),   # Smaller group, larger var
        (45, 15, 30, 10, 5),   # Larger group, larger var
    ]

    results = []
    for n1, n2, sd1, sd2, diff in scenarios:
        student_power = 0
        welch_power = 0

        for _ in range(n_sims):
            g1 = np.random.normal(50, sd1, n1)
            g2 = np.random.normal(50 + diff, sd2, n2)

            _, p_student = stats.ttest_ind(g1, g2, equal_var=True)
            _, p_welch = stats.ttest_ind(g1, g2, equal_var=False)

            if p_student < 0.05:
                student_power += 1
            if p_welch < 0.05:
                welch_power += 1

        results.append({
            'scenario': f'n=({n1},{n2}), SD=({sd1},{sd2})',
            'Student_power': student_power / n_sims,
            'Welch_power': welch_power / n_sims,
            'var_ratio': max(sd1, sd2)**2 / min(sd1, sd2)**2
        })

    print("Power Comparison (true effect = 5):")
    print("=" * 70)
    print(f"{'Scenario':<30} {'Student':>12} {'Welch':>12} {'Var Ratio':>10}")
    print("-" * 70)
    for r in results:
        diff = r['Welch_power'] - r['Student_power']
        marker = f" ({diff:+.1%})" if abs(diff) > 0.02 else ""
        print(f"{r['scenario']:<30} {r['Student_power']:>12.1%} "
              f"{r['Welch_power']:>12.1%} {r['var_ratio']:>10.1f}{marker}")


welch_as_default_justification()

Multiple Groups: Welch's ANOVA

The Extension

For more than two groups, use Welch's ANOVA instead of standard one-way ANOVA.

from scipy.stats import f_oneway, alexandergovern

def compare_anova_methods(*groups):
    """
    Compare standard ANOVA with Welch's ANOVA.
    """
    # Standard one-way ANOVA
    f_stat, p_standard = f_oneway(*groups)

    # Welch's ANOVA (Alexander-Govern test)
    result = alexandergovern(*groups)

    # Variance info
    variances = [np.var(g, ddof=1) for g in groups]
    var_ratio = max(variances) / min(variances)

    print("ANOVA Comparison:")
    print("-" * 40)
    print(f"Standard ANOVA: F = {f_stat:.3f}, p = {p_standard:.4f}")
    print(f"Welch's ANOVA: stat = {result.statistic:.3f}, p = {result.pvalue:.4f}")
    print(f"\nVariance ratio: {var_ratio:.2f}")
    print(f"Variances: {[f'{v:.1f}' for v in variances]}")


# Example
np.random.seed(42)
g1 = np.random.normal(50, 5, 30)
g2 = np.random.normal(52, 15, 25)
g3 = np.random.normal(55, 10, 35)

compare_anova_methods(g1, g2, g3)

Post-Hoc: Games-Howell

When using Welch's ANOVA, use Games-Howell for post-hoc comparisons:

def games_howell_comparison(groups, names):
    """
    Pairwise comparisons without equal variance assumption.
    """
    import scikit_posthocs as sp

    all_data = np.concatenate(groups)
    labels = np.repeat(names, [len(g) for g in groups])
    df = pd.DataFrame({'value': all_data, 'group': labels})

    # Welch t-tests with correction
    result = sp.posthoc_ttest(
        df, val_col='value', group_col='group',
        equal_var=False, p_adjust='holm'
    )

    return result

Practical Workflow

def complete_variance_workflow(group1, group2, name1='Group 1', name2='Group 2'):
    """
    Complete workflow for handling variance assumption.
    """
    print("=" * 60)
    print("VARIANCE ASSUMPTION WORKFLOW")
    print("=" * 60)

    # 1. Descriptives
    print("\n1. DESCRIPTIVE STATISTICS")
    print("-" * 40)
    for name, g in [(name1, group1), (name2, group2)]:
        print(f"\n{name}:")
        print(f"  n = {len(g)}")
        print(f"  Mean = {np.mean(g):.3f}")
        print(f"  SD = {np.std(g, ddof=1):.3f}")
        print(f"  Variance = {np.var(g, ddof=1):.3f}")

    # 2. Variance assessment
    print("\n\n2. VARIANCE ASSESSMENT")
    print("-" * 40)
    var1 = np.var(group1, ddof=1)
    var2 = np.var(group2, ddof=1)
    var_ratio = max(var1, var2) / min(var1, var2)

    stat, p_levene = levene(group1, group2, center='median')

    print(f"Variance ratio: {var_ratio:.2f}")
    print(f"Levene's test: p = {p_levene:.4f}")

    if var_ratio < 2:
        print("Assessment: Variances are reasonably similar")
    elif var_ratio < 4:
        print("Assessment: Moderate variance inequality")
    else:
        print("Assessment: Substantial variance inequality")

    # 3. Both tests
    print("\n\n3. ANALYSIS RESULTS")
    print("-" * 40)

    t_student, p_student = stats.ttest_ind(group1, group2, equal_var=True)
    t_welch, p_welch = stats.ttest_ind(group1, group2, equal_var=False)

    print(f"Student's t: t = {t_student:.3f}, p = {p_student:.4f}")
    print(f"Welch's t:   t = {t_welch:.3f}, p = {p_welch:.4f}")

    # 4. Recommendation
    print("\n\n4. RECOMMENDATION")
    print("-" * 40)

    if var_ratio > 2:
        print(f"Use Welch's result: p = {p_welch:.4f}")
        print("(Variance ratio > 2 suggests unequal variances)")
    elif abs(p_student - p_welch) > 0.01:
        print(f"Use Welch's result: p = {p_welch:.4f}")
        print("(Results differ; Welch is more robust)")
    else:
        print(f"Either result is fine: p ≈ {p_welch:.4f}")
        print("(Welch recommended as safe default)")

    # 5. Effect size
    print("\n\n5. EFFECT SIZE")
    print("-" * 40)
    pooled_std = np.sqrt(((len(group1)-1)*var1 + (len(group2)-1)*var2) /
                         (len(group1) + len(group2) - 2))
    cohens_d = (np.mean(group1) - np.mean(group2)) / pooled_std
    print(f"Cohen's d = {cohens_d:.3f}")

    print("\n" + "=" * 60)


# Example
np.random.seed(42)
control = np.random.normal(50, 8, 35)
treatment = np.random.normal(55, 18, 25)

complete_variance_workflow(control, treatment, 'Control', 'Treatment')

R Implementation

variance_workflow <- function(group1, group2) {
  cat("VARIANCE ASSUMPTION WORKFLOW\n")
  cat(rep("=", 50), "\n\n")

  # Descriptives
  cat("1. DESCRIPTIVES\n")
  cat("Group 1: n =", length(group1), ", SD =", round(sd(group1), 3), "\n")
  cat("Group 2: n =", length(group2), ", SD =", round(sd(group2), 3), "\n")

  # Variance ratio
  var_ratio <- max(var(group1), var(group2)) / min(var(group1), var(group2))
  cat("\nVariance ratio:", round(var_ratio, 2), "\n")

  # Levene's test
  library(car)
  df <- data.frame(
    value = c(group1, group2),
    group = factor(rep(c("G1", "G2"), c(length(group1), length(group2))))
  )
  lev <- leveneTest(value ~ group, data = df, center = median)
  cat("Levene's test p-value:", round(lev$`Pr(>F)`[1], 4), "\n")

  # Both tests
  cat("\n2. T-TEST RESULTS\n")
  cat("Student's:", round(t.test(group1, group2, var.equal = TRUE)$p.value, 4), "\n")
  cat("Welch's:", round(t.test(group1, group2, var.equal = FALSE)$p.value, 4), "\n")

  # Recommendation
  cat("\n3. RECOMMENDATION\n")
  if (var_ratio > 2) {
    cat("Use Welch's t-test (variance ratio > 2)\n")
  } else {
    cat("Either test OK, but Welch is safe default\n")
  }
}

# Usage
# group1 <- rnorm(30, 50, 5)
# group2 <- rnorm(25, 52, 15)
# variance_workflow(group1, group2)


Key Takeaway

The equal variance assumption matters more than normality for t-tests. When violated alongside unequal sample sizes, standard tests can be badly wrong. Welch's t-test handles this correctly with almost no cost when variances are actually equal. Make Welch your default—it's robust to the assumption and performs nearly identically when the assumption holds.


References

  1. https://www.jstor.org/stable/2529310
  2. https://www.jstor.org/stable/2684452
  3. Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch's t-test instead of Student's t-test. *International Review of Social Psychology*, 30(1).
  4. Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student's t-test and the Mann-Whitney U test. *Behavioral Ecology*, 17(4), 688-690.
  5. Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. *British Journal of Mathematical and Statistical Psychology*, 57(1), 173-181.

Frequently Asked Questions

Should I always use Welch's t-test?
Yes, it's a safe default. When variances are equal, Welch's is nearly identical to Student's. When they're unequal, Welch's is correct and Student's can be badly wrong.
How do I know if variances are unequal?
Calculate the variance ratio (larger/smaller). Ratios above 2 suggest potential problems; above 3-4 definitely require correction. Levene's test can help but has the same issues as normality tests.
Does sample size matter for this assumption?
Yes, critically. Unequal variance with equal sample sizes is less problematic than unequal variance with unequal sample sizes. The worst case: smaller group has larger variance.

Key Takeaway

The equal variance assumption matters more than normality for t-tests and ANOVA. When variances are unequal and sample sizes differ, standard tests can have inflated Type I error (when smaller groups have larger variance) or reduced power (opposite case). Welch's t-test handles this correctly and should be your default choice.

Send to a friend

Share this with someone who loves clean statistical work.