Multi-Group Comparisons

Heteroskedastic Groups: When Variances Differ and What to Do About It

How to handle multi-group comparisons when variances are unequal. Covers Welch's ANOVA, Games-Howell post-hoc, and why this matters more than non-normality.

Share

Quick Hits

  • Unequal variances matter more than non-normality for ANOVA validity
  • Welch's ANOVA handles heteroskedasticity in the omnibus test
  • Games-Howell provides robust post-hoc comparisons without equal variance assumption
  • When smaller groups have larger variance, standard ANOVA inflates Type I error

TL;DR

When comparing multiple groups, unequal variances (heteroskedasticity) cause more problems than non-normality. Standard ANOVA assumes equal variances; violations can inflate or deflate Type I error depending on which groups are larger. The solution: use Welch's ANOVA for the omnibus test and Games-Howell for post-hoc comparisons. These methods don't assume equal variances and perform well regardless.


Why Heteroskedasticity Matters

The Problem

Standard ANOVA pools variance across groups: $$MS_{within} = \frac{\sum_{i=1}^{k} (n_i - 1)s_i^2}{\sum_{i=1}^{k} (n_i - 1)}$$

This weighted average is appropriate only when all $s_i^2$ estimate the same population variance.

What Goes Wrong

Larger variance in smaller groups: Type I error inflates (more false positives)

Larger variance in larger groups: Test becomes conservative (less power)

import numpy as np
from scipy import stats

def simulate_heteroskedasticity_impact(n_sims=10000):
    """
    Simulate ANOVA under null with unequal variances.
    """
    scenarios = {
        'Equal variances': ([30, 30, 30], [1, 1, 1]),
        'Larger var in smaller group': ([15, 30, 45], [4, 1, 1]),
        'Larger var in larger group': ([45, 30, 15], [4, 1, 1]),
    }

    results = {}
    for name, (ns, vars_) in scenarios.items():
        rejections = 0
        for _ in range(n_sims):
            groups = [np.random.normal(0, np.sqrt(v), n)
                     for n, v in zip(ns, vars_)]
            _, p = stats.f_oneway(*groups)
            if p < 0.05:
                rejections += 1
        results[name] = rejections / n_sims

    return results


results = simulate_heteroskedasticity_impact()
for scenario, rate in results.items():
    print(f"{scenario}: Type I error = {rate:.3f}")

Detecting Heteroskedasticity

Visual Inspection

import matplotlib.pyplot as plt

def plot_variance_comparison(groups, group_names):
    """Visualize variance differences across groups."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Boxplots
    axes[0].boxplot(groups, labels=group_names)
    axes[0].set_ylabel('Value')
    axes[0].set_title('Boxplots by Group')

    # Variance bar chart
    variances = [np.var(g, ddof=1) for g in groups]
    axes[1].bar(group_names, variances)
    axes[1].set_ylabel('Variance')
    axes[1].set_title('Variance by Group')

    plt.tight_layout()
    return fig

Levene's Test

from scipy.stats import levene

def check_homogeneity(groups):
    """
    Test for equal variances using Levene's test.
    """
    # Levene's test (robust to non-normality)
    stat, p = levene(*groups, center='median')

    variances = [np.var(g, ddof=1) for g in groups]
    var_ratio = max(variances) / min(variances)

    return {
        'levene_stat': stat,
        'p_value': p,
        'variance_ratio': var_ratio,
        'variances': variances,
        'conclusion': 'Unequal variances' if p < 0.05 else 'Equal variances (cannot reject)'
    }


# Example
group1 = np.random.normal(0, 1, 30)
group2 = np.random.normal(0, 3, 30)  # Much higher variance
group3 = np.random.normal(0, 2, 30)

result = check_homogeneity([group1, group2, group3])
print(f"Levene's test: p = {result['p_value']:.4f}")
print(f"Variance ratio: {result['variance_ratio']:.2f}")
print(f"Conclusion: {result['conclusion']}")

Rule of Thumb

Variance Ratio Assessment
< 2 Usually fine
2-3 Borderline—consider robust methods
3-4 Problematic—use robust methods
> 4 Definitely use robust methods

Solution 1: Welch's ANOVA

Welch's ANOVA doesn't assume equal variances. It uses a modified F-test with adjusted degrees of freedom.

Python Implementation

from scipy.stats import alexandergovern

def welch_anova(groups):
    """
    Welch's ANOVA for unequal variances.
    Uses Alexander-Govern approximation.
    """
    result = alexandergovern(*groups)

    return {
        'statistic': result.statistic,
        'p_value': result.pvalue
    }


# Compare standard and Welch's ANOVA
standard = stats.f_oneway(group1, group2, group3)
welch = welch_anova([group1, group2, group3])

print("With unequal variances:")
print(f"  Standard ANOVA: F = {standard.statistic:.2f}, p = {standard.pvalue:.4f}")
print(f"  Welch's ANOVA: stat = {welch['statistic']:.2f}, p = {welch['p_value']:.4f}")

R Implementation

# Welch's ANOVA
oneway.test(value ~ group, data = df, var.equal = FALSE)

When to Use

  • Levene's test significant (p < 0.05)
  • Variance ratio > 3
  • Unequal sample sizes (compounds the problem)
  • As a default (never worse than standard ANOVA)

Solution 2: Games-Howell Post-Hoc

For pairwise comparisons after Welch's ANOVA, Games-Howell doesn't assume equal variances.

How It Works

Games-Howell is essentially Tukey's HSD but:

  • Uses separate variance estimates for each comparison
  • Adjusts degrees of freedom (Welch-Satterthwaite)
  • Similar to running multiple Welch's t-tests with family-wise correction

Python Implementation

import scikit_posthocs as sp
import pandas as pd

def games_howell_test(groups, group_names):
    """
    Games-Howell test for pairwise comparisons with unequal variances.
    """
    all_data = np.concatenate(groups)
    labels = np.repeat(group_names, [len(g) for g in groups])
    df = pd.DataFrame({'value': all_data, 'group': labels})

    # Games-Howell (Welch t-tests with Holm correction)
    result = sp.posthoc_ttest(
        df, val_col='value', group_col='group',
        equal_var=False, p_adjust='holm'
    )

    return result


result = games_howell_test([group1, group2, group3], ['A', 'B', 'C'])
print("Games-Howell Results (adjusted p-values):")
print(result)

R Implementation

library(rstatix)
games_howell_test(df, value ~ group)

# Or
library(PMCMRplus)
gamesHowellTest(value ~ group, data = df)

Complete Robust Workflow

def robust_multigroup_comparison(groups, group_names, alpha=0.05):
    """
    Complete workflow for comparing groups with potential heteroskedasticity.
    """
    print("=" * 60)
    print("ROBUST MULTI-GROUP COMPARISON")
    print("=" * 60)

    # 1. Descriptive statistics
    print("\n1. DESCRIPTIVE STATISTICS")
    print("-" * 40)
    for name, g in zip(group_names, groups):
        print(f"  {name}: n={len(g)}, M={np.mean(g):.2f}, "
              f"SD={np.std(g, ddof=1):.2f}, Var={np.var(g, ddof=1):.2f}")

    # 2. Check variance homogeneity
    print("\n2. VARIANCE HOMOGENEITY CHECK")
    print("-" * 40)
    homo = check_homogeneity(groups)
    print(f"  Levene's test: p = {homo['p_value']:.4f}")
    print(f"  Variance ratio: {homo['variance_ratio']:.2f}")

    # 3. Omnibus test (always use Welch's for robustness)
    print("\n3. OMNIBUS TEST (Welch's ANOVA)")
    print("-" * 40)
    welch = welch_anova(groups)
    print(f"  Statistic = {welch['statistic']:.2f}, p = {welch['p_value']:.4f}")

    if welch['p_value'] >= alpha:
        print(f"\n  Result: Not significant (p >= {alpha})")
        print("  No post-hoc comparisons warranted.")
        return

    # 4. Post-hoc (Games-Howell)
    print(f"\n4. POST-HOC COMPARISONS (Games-Howell)")
    print("-" * 40)
    print("  Adjusted p-values:")
    gh_result = games_howell_test(groups, group_names)
    print(gh_result.to_string())

    # 5. Summary
    print("\n5. SUMMARY")
    print("-" * 40)
    significant_pairs = []
    for i, name_i in enumerate(group_names):
        for j, name_j in enumerate(group_names):
            if i < j:
                p = gh_result.iloc[i, j]
                if p < alpha:
                    significant_pairs.append(f"{name_i} vs {name_j}")

    if significant_pairs:
        print(f"  Significant differences: {', '.join(significant_pairs)}")
    else:
        print("  No significant pairwise differences found.")


# Example usage
np.random.seed(42)
control = np.random.normal(50, 5, 25)      # Low variance
treatment_a = np.random.normal(55, 15, 25)  # High variance
treatment_b = np.random.normal(52, 10, 25)  # Medium variance

robust_multigroup_comparison(
    [control, treatment_a, treatment_b],
    ['Control', 'Treatment A', 'Treatment B']
)

Comparison: Standard vs. Robust Methods

def compare_methods(groups, group_names):
    """Compare standard and robust approaches."""
    print("Method Comparison:")
    print("-" * 50)

    # Standard ANOVA + Tukey's
    f_stat, anova_p = stats.f_oneway(*groups)
    print(f"Standard ANOVA: F = {f_stat:.2f}, p = {anova_p:.4f}")

    # Welch's ANOVA
    welch = alexandergovern(*groups)
    print(f"Welch's ANOVA: stat = {welch.statistic:.2f}, p = {welch.pvalue:.4f}")

    # Post-hoc comparison (if significant)
    if anova_p < 0.05 or welch.pvalue < 0.05:
        all_data = np.concatenate(groups)
        labels = np.repeat(group_names, [len(g) for g in groups])

        # Tukey's
        from statsmodels.stats.multicomp import pairwise_tukeyhsd
        tukey = pairwise_tukeyhsd(all_data, labels)

        # Games-Howell
        df = pd.DataFrame({'value': all_data, 'group': labels})
        gh = sp.posthoc_ttest(df, val_col='value', group_col='group',
                              equal_var=False, p_adjust='holm')

        print("\nPost-hoc comparison (first pair):")
        print(f"  Tukey's HSD: p = {tukey.pvalues[0]:.4f}")
        print(f"  Games-Howell: p = {gh.iloc[0, 1]:.4f}")


Key Takeaway

Heteroskedasticity (unequal variances) is more damaging to ANOVA validity than non-normality. When variances differ across groups, use Welch's ANOVA for the omnibus test and Games-Howell for post-hoc comparisons. These robust methods perform well even when variances happen to be equal, so they can be used as defaults.


References

  1. https://www.jstor.org/stable/2529310
  2. https://www.jstor.org/stable/2684452
  3. Welch, B. L. (1951). On the comparison of several mean values: an alternative approach. *Biometrika*, 38(3/4), 330-336.
  4. Games, P. A., & Howell, J. F. (1976). Pairwise multiple comparison procedures with unequal n's and/or variances. *Journal of Educational Statistics*, 1(2), 113-125.
  5. Tomarken, A. J., & Serlin, R. C. (1986). Comparison of ANOVA alternatives under variance heterogeneity and specific noncentrality structures. *Psychological Bulletin*, 99(1), 90-99.

Frequently Asked Questions

How much variance difference is problematic?
A variance ratio (largest/smallest) above 3 warrants concern. Above 4, definitely use Welch's ANOVA and Games-Howell.
Does Games-Howell require equal sample sizes?
No. Games-Howell handles both unequal variances and unequal sample sizes, making it quite robust.
Should I always use Welch's ANOVA?
It's a reasonable default. Welch's performs nearly identically to standard ANOVA when variances are equal and correctly when they're not.

Key Takeaway

Heteroskedasticity (unequal variances) is more damaging to ANOVA validity than non-normality. When variances differ across groups, use Welch's ANOVA for the omnibus test and Games-Howell for post-hoc comparisons. These robust methods perform well even when variances happen to be equal.

Send to a friend

Share this with someone who loves clean statistical work.