Multi-Group Comparisons

Heteroskedastic Groups: When Variances Differ and What to Do About It

How to handle multi-group comparisons when variances are unequal. Covers Welch's ANOVA, Games-Howell post-hoc, and why this matters more than non-normality.

Jan 266 min readstatstest_flow Multi-Group Comparisons Supporting

Heteroskedastic Groups: When Variances Differ and What to Do About It

Quick Hits

•Unequal variances matter more than non-normality for ANOVA validity
•Welch's ANOVA handles heteroskedasticity in the omnibus test
•Games-Howell provides robust post-hoc comparisons without equal variance assumption
•When smaller groups have larger variance, standard ANOVA inflates Type I error

TL;DR

When comparing multiple groups, unequal variances (heteroskedasticity) cause more problems than non-normality. Standard ANOVA assumes equal variances; violations can inflate or deflate Type I error depending on which groups are larger. The solution: use Welch's ANOVA for the omnibus test and Games-Howell for post-hoc comparisons. These methods don't assume equal variances and perform well regardless.

Why Heteroskedasticity Matters

The Problem

Standard ANOVA pools variance across groups: $MS_{within} = \frac{\sum_{i=1}^{k} (n_i - 1)s_i^2}{\sum_{i=1}^{k} (n_i - 1)}$

This weighted average is appropriate only when all $s_i^2$ estimate the same population variance.

What Goes Wrong

Larger variance in smaller groups: Type I error inflates (more false positives)

Larger variance in larger groups: Test becomes conservative (less power)

import numpy as np
from scipy import stats

def simulate_heteroskedasticity_impact(n_sims=10000):
    """
    Simulate ANOVA under null with unequal variances.
    """
    scenarios = {
        'Equal variances': ([30, 30, 30], [1, 1, 1]),
        'Larger var in smaller group': ([15, 30, 45], [4, 1, 1]),
        'Larger var in larger group': ([45, 30, 15], [4, 1, 1]),
    }

    results = {}
    for name, (ns, vars_) in scenarios.items():
        rejections = 0
        for _ in range(n_sims):
            groups = [np.random.normal(0, np.sqrt(v), n)
                     for n, v in zip(ns, vars_)]
            _, p = stats.f_oneway(*groups)
            if p < 0.05:
                rejections += 1
        results[name] = rejections / n_sims

    return results


results = simulate_heteroskedasticity_impact()
for scenario, rate in results.items():
    print(f"{scenario}: Type I error = {rate:.3f}")

Detecting Heteroskedasticity

Visual Inspection

import matplotlib.pyplot as plt

def plot_variance_comparison(groups, group_names):
    """Visualize variance differences across groups."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Boxplots
    axes[0].boxplot(groups, labels=group_names)
    axes[0].set_ylabel('Value')
    axes[0].set_title('Boxplots by Group')

    # Variance bar chart
    variances = [np.var(g, ddof=1) for g in groups]
    axes[1].bar(group_names, variances)
    axes[1].set_ylabel('Variance')
    axes[1].set_title('Variance by Group')

    plt.tight_layout()
    return fig

Levene's Test

from scipy.stats import levene

def check_homogeneity(groups):
    """
    Test for equal variances using Levene's test.
    """
    # Levene's test (robust to non-normality)
    stat, p = levene(*groups, center='median')

    variances = [np.var(g, ddof=1) for g in groups]
    var_ratio = max(variances) / min(variances)

    return {
        'levene_stat': stat,
        'p_value': p,
        'variance_ratio': var_ratio,
        'variances': variances,
        'conclusion': 'Unequal variances' if p < 0.05 else 'Equal variances (cannot reject)'
    }


# Example
group1 = np.random.normal(0, 1, 30)
group2 = np.random.normal(0, 3, 30)  # Much higher variance
group3 = np.random.normal(0, 2, 30)

result = check_homogeneity([group1, group2, group3])
print(f"Levene's test: p = {result['p_value']:.4f}")
print(f"Variance ratio: {result['variance_ratio']:.2f}")
print(f"Conclusion: {result['conclusion']}")

Rule of Thumb

Variance Ratio	Assessment
< 2	Usually fine
2-3	Borderline—consider robust methods
3-4	Problematic—use robust methods
> 4	Definitely use robust methods

Solution 1: Welch's ANOVA

Welch's ANOVA doesn't assume equal variances. It uses a modified F-test with adjusted degrees of freedom.

Python Implementation

from scipy.stats import alexandergovern

def welch_anova(groups):
    """
    Welch's ANOVA for unequal variances.
    Uses Alexander-Govern approximation.
    """
    result = alexandergovern(*groups)

    return {
        'statistic': result.statistic,
        'p_value': result.pvalue
    }


# Compare standard and Welch's ANOVA
standard = stats.f_oneway(group1, group2, group3)
welch = welch_anova([group1, group2, group3])

print("With unequal variances:")
print(f"  Standard ANOVA: F = {standard.statistic:.2f}, p = {standard.pvalue:.4f}")
print(f"  Welch's ANOVA: stat = {welch['statistic']:.2f}, p = {welch['p_value']:.4f}")

R Implementation

# Welch's ANOVA
oneway.test(value ~ group, data = df, var.equal = FALSE)

When to Use

Levene's test significant ( $p < 0.05$ )
Variance ratio > 3
Unequal sample sizes (compounds the problem)
As a default (never worse than standard ANOVA)

Solution 2: Games-Howell Post-Hoc

For pairwise comparisons after Welch's ANOVA, Games-Howell doesn't assume equal variances.

How It Works

Games-Howell is essentially Tukey's HSD but:

Uses separate variance estimates for each comparison
Adjusts degrees of freedom (Welch-Satterthwaite)
Similar to running multiple Welch's t-tests with family-wise correction

Python Implementation

import scikit_posthocs as sp
import pandas as pd

def games_howell_test(groups, group_names):
    """
    Games-Howell test for pairwise comparisons with unequal variances.
    """
    all_data = np.concatenate(groups)
    labels = np.repeat(group_names, [len(g) for g in groups])
    df = pd.DataFrame({'value': all_data, 'group': labels})

    # Games-Howell (Welch t-tests with Holm correction)
    result = sp.posthoc_ttest(
        df, val_col='value', group_col='group',
        equal_var=False, p_adjust='holm'
    )

    return result


result = games_howell_test([group1, group2, group3], ['A', 'B', 'C'])
print("Games-Howell Results (adjusted p-values):")
print(result)

R Implementation

library(rstatix)
games_howell_test(df, value ~ group)

# Or
library(PMCMRplus)
gamesHowellTest(value ~ group, data = df)

Complete Robust Workflow

def robust_multigroup_comparison(groups, group_names, alpha=0.05):
    """
    Complete workflow for comparing groups with potential heteroskedasticity.
    """
    print("=" * 60)
    print("ROBUST MULTI-GROUP COMPARISON")
    print("=" * 60)

    # 1. Descriptive statistics
    print("\n1. DESCRIPTIVE STATISTICS")
    print("-" * 40)
    for name, g in zip(group_names, groups):
        print(f"  {name}: n={len(g)}, M={np.mean(g):.2f}, "
              f"SD={np.std(g, ddof=1):.2f}, Var={np.var(g, ddof=1):.2f}")

    # 2. Check variance homogeneity
    print("\n2. VARIANCE HOMOGENEITY CHECK")
    print("-" * 40)
    homo = check_homogeneity(groups)
    print(f"  Levene's test: p = {homo['p_value']:.4f}")
    print(f"  Variance ratio: {homo['variance_ratio']:.2f}")

    # 3. Omnibus test (always use Welch's for robustness)
    print("\n3. OMNIBUS TEST (Welch's ANOVA)")
    print("-" * 40)
    welch = welch_anova(groups)
    print(f"  Statistic = {welch['statistic']:.2f}, p = {welch['p_value']:.4f}")

    if welch['p_value'] >= alpha:
        print(f"\n  Result: Not significant (p >= {alpha})")
        print("  No post-hoc comparisons warranted.")
        return

    # 4. Post-hoc (Games-Howell)
    print(f"\n4. POST-HOC COMPARISONS (Games-Howell)")
    print("-" * 40)
    print("  Adjusted p-values:")
    gh_result = games_howell_test(groups, group_names)
    print(gh_result.to_string())

    # 5. Summary
    print("\n5. SUMMARY")
    print("-" * 40)
    significant_pairs = []
    for i, name_i in enumerate(group_names):
        for j, name_j in enumerate(group_names):
            if i < j:
                p = gh_result.iloc[i, j]
                if p < alpha:
                    significant_pairs.append(f"{name_i} vs {name_j}")

    if significant_pairs:
        print(f"  Significant differences: {', '.join(significant_pairs)}")
    else:
        print("  No significant pairwise differences found.")


# Example usage
np.random.seed(42)
control = np.random.normal(50, 5, 25)      # Low variance
treatment_a = np.random.normal(55, 15, 25)  # High variance
treatment_b = np.random.normal(52, 10, 25)  # Medium variance

robust_multigroup_comparison(
    [control, treatment_a, treatment_b],
    ['Control', 'Treatment A', 'Treatment B']
)

Comparison: Standard vs. Robust Methods

def compare_methods(groups, group_names):
    """Compare standard and robust approaches."""
    print("Method Comparison:")
    print("-" * 50)

    # Standard ANOVA + Tukey's
    f_stat, anova_p = stats.f_oneway(*groups)
    print(f"Standard ANOVA: F = {f_stat:.2f}, p = {anova_p:.4f}")

    # Welch's ANOVA
    welch = alexandergovern(*groups)
    print(f"Welch's ANOVA: stat = {welch.statistic:.2f}, p = {welch.pvalue:.4f}")

    # Post-hoc comparison (if significant)
    if anova_p < 0.05 or welch.pvalue < 0.05:
        all_data = np.concatenate(groups)
        labels = np.repeat(group_names, [len(g) for g in groups])

        # Tukey's
        from statsmodels.stats.multicomp import pairwise_tukeyhsd
        tukey = pairwise_tukeyhsd(all_data, labels)

        # Games-Howell
        df = pd.DataFrame({'value': all_data, 'group': labels})
        gh = sp.posthoc_ttest(df, val_col='value', group_col='group',
                              equal_var=False, p_adjust='holm')

        print("\nPost-hoc comparison (first pair):")
        print(f"  Tukey's HSD: p = {tukey.pvalues[0]:.4f}")
        print(f"  Games-Howell: p = {gh.iloc[0, 1]:.4f}")

Comparing More Than Two Groups — The pillar guide
Post-Hoc Tests Decision Tree — Choosing the right test
Comparing Variances: Levene, Bartlett, F-Test — Testing for equal variances

Key Takeaway

Heteroskedasticity (unequal variances) is more damaging to ANOVA validity than non-normality. When variances differ across groups, use Welch's ANOVA for the omnibus test and Games-Howell for post-hoc comparisons. These robust methods perform well even when variances happen to be equal, so they can be used as defaults.

References

https://www.jstor.org/stable/2529310
https://www.jstor.org/stable/2684452
Welch, B. L. (1951). On the comparison of several mean values: an alternative approach. *Biometrika*, 38(3/4), 330-336.
Games, P. A., & Howell, J. F. (1976). Pairwise multiple comparison procedures with unequal n's and/or variances. *Journal of Educational Statistics*, 1(2), 113-125.
Tomarken, A. J., & Serlin, R. C. (1986). Comparison of ANOVA alternatives under variance heterogeneity and specific noncentrality structures. *Psychological Bulletin*, 99(1), 90-99.

Frequently Asked Questions

How much variance difference is problematic?

A variance ratio (largest/smallest) above 3 warrants concern. Above 4, definitely use Welch's ANOVA and Games-Howell.

Does Games-Howell require equal sample sizes?

No. Games-Howell handles both unequal variances and unequal sample sizes, making it quite robust.

Should I always use Welch's ANOVA?

It's a reasonable default. Welch's performs nearly identically to standard ANOVA when variances are equal and correctly when they're not.

Key Takeaway

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email