Contents
Heteroskedastic Groups: When Variances Differ and What to Do About It
How to handle multi-group comparisons when variances are unequal. Covers Welch's ANOVA, Games-Howell post-hoc, and why this matters more than non-normality.
Quick Hits
- •Unequal variances matter more than non-normality for ANOVA validity
- •Welch's ANOVA handles heteroskedasticity in the omnibus test
- •Games-Howell provides robust post-hoc comparisons without equal variance assumption
- •When smaller groups have larger variance, standard ANOVA inflates Type I error
TL;DR
When comparing multiple groups, unequal variances (heteroskedasticity) cause more problems than non-normality. Standard ANOVA assumes equal variances; violations can inflate or deflate Type I error depending on which groups are larger. The solution: use Welch's ANOVA for the omnibus test and Games-Howell for post-hoc comparisons. These methods don't assume equal variances and perform well regardless.
Why Heteroskedasticity Matters
The Problem
Standard ANOVA pools variance across groups: $$MS_{within} = \frac{\sum_{i=1}^{k} (n_i - 1)s_i^2}{\sum_{i=1}^{k} (n_i - 1)}$$
This weighted average is appropriate only when all $s_i^2$ estimate the same population variance.
What Goes Wrong
Larger variance in smaller groups: Type I error inflates (more false positives)
Larger variance in larger groups: Test becomes conservative (less power)
import numpy as np
from scipy import stats
def simulate_heteroskedasticity_impact(n_sims=10000):
"""
Simulate ANOVA under null with unequal variances.
"""
scenarios = {
'Equal variances': ([30, 30, 30], [1, 1, 1]),
'Larger var in smaller group': ([15, 30, 45], [4, 1, 1]),
'Larger var in larger group': ([45, 30, 15], [4, 1, 1]),
}
results = {}
for name, (ns, vars_) in scenarios.items():
rejections = 0
for _ in range(n_sims):
groups = [np.random.normal(0, np.sqrt(v), n)
for n, v in zip(ns, vars_)]
_, p = stats.f_oneway(*groups)
if p < 0.05:
rejections += 1
results[name] = rejections / n_sims
return results
results = simulate_heteroskedasticity_impact()
for scenario, rate in results.items():
print(f"{scenario}: Type I error = {rate:.3f}")
Detecting Heteroskedasticity
Visual Inspection
import matplotlib.pyplot as plt
def plot_variance_comparison(groups, group_names):
"""Visualize variance differences across groups."""
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Boxplots
axes[0].boxplot(groups, labels=group_names)
axes[0].set_ylabel('Value')
axes[0].set_title('Boxplots by Group')
# Variance bar chart
variances = [np.var(g, ddof=1) for g in groups]
axes[1].bar(group_names, variances)
axes[1].set_ylabel('Variance')
axes[1].set_title('Variance by Group')
plt.tight_layout()
return fig
Levene's Test
from scipy.stats import levene
def check_homogeneity(groups):
"""
Test for equal variances using Levene's test.
"""
# Levene's test (robust to non-normality)
stat, p = levene(*groups, center='median')
variances = [np.var(g, ddof=1) for g in groups]
var_ratio = max(variances) / min(variances)
return {
'levene_stat': stat,
'p_value': p,
'variance_ratio': var_ratio,
'variances': variances,
'conclusion': 'Unequal variances' if p < 0.05 else 'Equal variances (cannot reject)'
}
# Example
group1 = np.random.normal(0, 1, 30)
group2 = np.random.normal(0, 3, 30) # Much higher variance
group3 = np.random.normal(0, 2, 30)
result = check_homogeneity([group1, group2, group3])
print(f"Levene's test: p = {result['p_value']:.4f}")
print(f"Variance ratio: {result['variance_ratio']:.2f}")
print(f"Conclusion: {result['conclusion']}")
Rule of Thumb
| Variance Ratio | Assessment |
|---|---|
| < 2 | Usually fine |
| 2-3 | Borderline—consider robust methods |
| 3-4 | Problematic—use robust methods |
| > 4 | Definitely use robust methods |
Solution 1: Welch's ANOVA
Welch's ANOVA doesn't assume equal variances. It uses a modified F-test with adjusted degrees of freedom.
Python Implementation
from scipy.stats import alexandergovern
def welch_anova(groups):
"""
Welch's ANOVA for unequal variances.
Uses Alexander-Govern approximation.
"""
result = alexandergovern(*groups)
return {
'statistic': result.statistic,
'p_value': result.pvalue
}
# Compare standard and Welch's ANOVA
standard = stats.f_oneway(group1, group2, group3)
welch = welch_anova([group1, group2, group3])
print("With unequal variances:")
print(f" Standard ANOVA: F = {standard.statistic:.2f}, p = {standard.pvalue:.4f}")
print(f" Welch's ANOVA: stat = {welch['statistic']:.2f}, p = {welch['p_value']:.4f}")
R Implementation
# Welch's ANOVA
oneway.test(value ~ group, data = df, var.equal = FALSE)
When to Use
- Levene's test significant (p < 0.05)
- Variance ratio > 3
- Unequal sample sizes (compounds the problem)
- As a default (never worse than standard ANOVA)
Solution 2: Games-Howell Post-Hoc
For pairwise comparisons after Welch's ANOVA, Games-Howell doesn't assume equal variances.
How It Works
Games-Howell is essentially Tukey's HSD but:
- Uses separate variance estimates for each comparison
- Adjusts degrees of freedom (Welch-Satterthwaite)
- Similar to running multiple Welch's t-tests with family-wise correction
Python Implementation
import scikit_posthocs as sp
import pandas as pd
def games_howell_test(groups, group_names):
"""
Games-Howell test for pairwise comparisons with unequal variances.
"""
all_data = np.concatenate(groups)
labels = np.repeat(group_names, [len(g) for g in groups])
df = pd.DataFrame({'value': all_data, 'group': labels})
# Games-Howell (Welch t-tests with Holm correction)
result = sp.posthoc_ttest(
df, val_col='value', group_col='group',
equal_var=False, p_adjust='holm'
)
return result
result = games_howell_test([group1, group2, group3], ['A', 'B', 'C'])
print("Games-Howell Results (adjusted p-values):")
print(result)
R Implementation
library(rstatix)
games_howell_test(df, value ~ group)
# Or
library(PMCMRplus)
gamesHowellTest(value ~ group, data = df)
Complete Robust Workflow
def robust_multigroup_comparison(groups, group_names, alpha=0.05):
"""
Complete workflow for comparing groups with potential heteroskedasticity.
"""
print("=" * 60)
print("ROBUST MULTI-GROUP COMPARISON")
print("=" * 60)
# 1. Descriptive statistics
print("\n1. DESCRIPTIVE STATISTICS")
print("-" * 40)
for name, g in zip(group_names, groups):
print(f" {name}: n={len(g)}, M={np.mean(g):.2f}, "
f"SD={np.std(g, ddof=1):.2f}, Var={np.var(g, ddof=1):.2f}")
# 2. Check variance homogeneity
print("\n2. VARIANCE HOMOGENEITY CHECK")
print("-" * 40)
homo = check_homogeneity(groups)
print(f" Levene's test: p = {homo['p_value']:.4f}")
print(f" Variance ratio: {homo['variance_ratio']:.2f}")
# 3. Omnibus test (always use Welch's for robustness)
print("\n3. OMNIBUS TEST (Welch's ANOVA)")
print("-" * 40)
welch = welch_anova(groups)
print(f" Statistic = {welch['statistic']:.2f}, p = {welch['p_value']:.4f}")
if welch['p_value'] >= alpha:
print(f"\n Result: Not significant (p >= {alpha})")
print(" No post-hoc comparisons warranted.")
return
# 4. Post-hoc (Games-Howell)
print(f"\n4. POST-HOC COMPARISONS (Games-Howell)")
print("-" * 40)
print(" Adjusted p-values:")
gh_result = games_howell_test(groups, group_names)
print(gh_result.to_string())
# 5. Summary
print("\n5. SUMMARY")
print("-" * 40)
significant_pairs = []
for i, name_i in enumerate(group_names):
for j, name_j in enumerate(group_names):
if i < j:
p = gh_result.iloc[i, j]
if p < alpha:
significant_pairs.append(f"{name_i} vs {name_j}")
if significant_pairs:
print(f" Significant differences: {', '.join(significant_pairs)}")
else:
print(" No significant pairwise differences found.")
# Example usage
np.random.seed(42)
control = np.random.normal(50, 5, 25) # Low variance
treatment_a = np.random.normal(55, 15, 25) # High variance
treatment_b = np.random.normal(52, 10, 25) # Medium variance
robust_multigroup_comparison(
[control, treatment_a, treatment_b],
['Control', 'Treatment A', 'Treatment B']
)
Comparison: Standard vs. Robust Methods
def compare_methods(groups, group_names):
"""Compare standard and robust approaches."""
print("Method Comparison:")
print("-" * 50)
# Standard ANOVA + Tukey's
f_stat, anova_p = stats.f_oneway(*groups)
print(f"Standard ANOVA: F = {f_stat:.2f}, p = {anova_p:.4f}")
# Welch's ANOVA
welch = alexandergovern(*groups)
print(f"Welch's ANOVA: stat = {welch.statistic:.2f}, p = {welch.pvalue:.4f}")
# Post-hoc comparison (if significant)
if anova_p < 0.05 or welch.pvalue < 0.05:
all_data = np.concatenate(groups)
labels = np.repeat(group_names, [len(g) for g in groups])
# Tukey's
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey = pairwise_tukeyhsd(all_data, labels)
# Games-Howell
df = pd.DataFrame({'value': all_data, 'group': labels})
gh = sp.posthoc_ttest(df, val_col='value', group_col='group',
equal_var=False, p_adjust='holm')
print("\nPost-hoc comparison (first pair):")
print(f" Tukey's HSD: p = {tukey.pvalues[0]:.4f}")
print(f" Games-Howell: p = {gh.iloc[0, 1]:.4f}")
Related Methods
- Comparing More Than Two Groups — The pillar guide
- Post-Hoc Tests Decision Tree — Choosing the right test
- Comparing Variances: Levene, Bartlett, F-Test — Testing for equal variances
Key Takeaway
Heteroskedasticity (unequal variances) is more damaging to ANOVA validity than non-normality. When variances differ across groups, use Welch's ANOVA for the omnibus test and Games-Howell for post-hoc comparisons. These robust methods perform well even when variances happen to be equal, so they can be used as defaults.
References
- https://www.jstor.org/stable/2529310
- https://www.jstor.org/stable/2684452
- Welch, B. L. (1951). On the comparison of several mean values: an alternative approach. *Biometrika*, 38(3/4), 330-336.
- Games, P. A., & Howell, J. F. (1976). Pairwise multiple comparison procedures with unequal n's and/or variances. *Journal of Educational Statistics*, 1(2), 113-125.
- Tomarken, A. J., & Serlin, R. C. (1986). Comparison of ANOVA alternatives under variance heterogeneity and specific noncentrality structures. *Psychological Bulletin*, 99(1), 90-99.
Frequently Asked Questions
How much variance difference is problematic?
Does Games-Howell require equal sample sizes?
Should I always use Welch's ANOVA?
Key Takeaway
Heteroskedasticity (unequal variances) is more damaging to ANOVA validity than non-normality. When variances differ across groups, use Welch's ANOVA for the omnibus test and Games-Howell for post-hoc comparisons. These robust methods perform well even when variances happen to be equal.