Contents
Equal Variance and Welch's T-Test: When It Actually Matters
A deep dive into the equal variance assumption for t-tests and ANOVA. Learn when violations are problematic, how to detect them, and why Welch's correction should be your default.
Quick Hits
- •Unequal variance matters more than non-normality for t-tests and ANOVA
- •The problem is worst when smaller groups have larger variance
- •Welch's t-test performs well whether variances are equal or not
- •Variance ratios above 3 warrant concern; above 4 definitely use Welch
TL;DR
Equal variance (homoscedasticity) matters more than normality for t-tests and ANOVA. When variances differ and sample sizes are unequal, the standard t-test's Type I error rate can be inflated or deflated depending on which group has larger variance. Welch's t-test corrects this and performs almost identically to Student's t-test when variances are equal. Make Welch your default.
Why Equal Variance Matters
The Pooled Variance Problem
The standard t-test pools variance from both groups:
$$s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}$$
This assumes $s_1^2 \approx s_2^2$. When they differ, this pooled estimate is wrong.
import numpy as np
from scipy import stats
import pandas as pd
def demonstrate_pooling_problem():
"""
Show how pooled variance goes wrong with unequal variances.
"""
np.random.seed(42)
# Two groups with very different variances
group1 = np.random.normal(50, 5, 30) # SD = 5
group2 = np.random.normal(50, 20, 30) # SD = 20
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
# Pooled variance
n1, n2 = len(group1), len(group2)
pooled_var = ((n1-1)*var1 + (n2-1)*var2) / (n1 + n2 - 2)
print("Variance comparison:")
print(f" Group 1 variance: {var1:.2f}")
print(f" Group 2 variance: {var2:.2f}")
print(f" Variance ratio: {max(var1, var2)/min(var1, var2):.2f}")
print(f" Pooled variance: {pooled_var:.2f}")
print()
print("The pooled variance doesn't represent either group well!")
demonstrate_pooling_problem()
The Critical Interaction: Variance × Sample Size
def variance_samplesize_interaction(n_sims=10000):
"""
Show how the variance-sample size interaction affects Type I error.
"""
np.random.seed(42)
scenarios = {
'Equal var, equal n': {
'n': [30, 30], 'sd': [10, 10]
},
'Unequal var, equal n': {
'n': [30, 30], 'sd': [10, 30]
},
'Larger var in SMALLER group': {
'n': [15, 45], 'sd': [30, 10] # Type I error INFLATES
},
'Larger var in LARGER group': {
'n': [45, 15], 'sd': [30, 10] # Test becomes CONSERVATIVE
}
}
results = {}
for name, params in scenarios.items():
student_reject = 0
welch_reject = 0
for _ in range(n_sims):
# Both groups have same mean (null is true)
g1 = np.random.normal(50, params['sd'][0], params['n'][0])
g2 = np.random.normal(50, params['sd'][1], params['n'][1])
_, p_student = stats.ttest_ind(g1, g2, equal_var=True)
_, p_welch = stats.ttest_ind(g1, g2, equal_var=False)
if p_student < 0.05:
student_reject += 1
if p_welch < 0.05:
welch_reject += 1
results[name] = {
'Student': student_reject / n_sims,
'Welch': welch_reject / n_sims,
'var_ratio': max(params['sd'])**2 / min(params['sd'])**2,
'n_ratio': max(params['n']) / min(params['n'])
}
print("Type I Error Rates (nominal α = 0.05)")
print("=" * 65)
print(f"{'Scenario':<35} {'Student':>10} {'Welch':>10} {'Var Ratio':>8}")
print("-" * 65)
for name, res in results.items():
marker = " ⚠️" if abs(res['Student'] - 0.05) > 0.02 else ""
print(f"{name:<35} {res['Student']:>10.3f} {res['Welch']:>10.3f} "
f"{res['var_ratio']:>8.1f}{marker}")
return results
variance_samplesize_interaction()
Detecting Unequal Variance
Visual Inspection
import matplotlib.pyplot as plt
def plot_variance_comparison(groups, group_names):
"""
Visual comparison of group variances.
"""
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Box plots
axes[0].boxplot(groups, labels=group_names)
axes[0].set_ylabel('Value')
axes[0].set_title('Box Plots (compare spread)')
# Variance bar chart
variances = [np.var(g, ddof=1) for g in groups]
axes[1].bar(group_names, variances, color=['steelblue', 'darkorange'])
axes[1].set_ylabel('Variance')
axes[1].set_title(f'Variances (ratio = {max(variances)/min(variances):.2f})')
# Sample sizes
ns = [len(g) for g in groups]
axes[2].bar(group_names, ns, color=['steelblue', 'darkorange'])
axes[2].set_ylabel('Sample Size')
axes[2].set_title('Sample Sizes')
plt.tight_layout()
return fig
Levene's Test
from scipy.stats import levene
def test_equal_variance(group1, group2, alpha=0.05):
"""
Test for equal variances with practical interpretation.
"""
# Levene's test (robust to non-normality)
stat, p = levene(group1, group2, center='median')
# Variance ratio
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
var_ratio = max(var1, var2) / min(var1, var2)
# Sample sizes
n1, n2 = len(group1), len(group2)
n_ratio = max(n1, n2) / min(n1, n2)
# Assess severity
if var_ratio < 2:
severity = 'Low'
recommendation = 'Standard t-test OK, but Welch is safe'
elif var_ratio < 3:
severity = 'Moderate'
recommendation = 'Use Welch\'s t-test'
elif var_ratio < 4:
severity = 'Substantial'
recommendation = 'Definitely use Welch\'s t-test'
else:
severity = 'Severe'
recommendation = 'Use Welch; consider transformation or rank test'
# Extra warning for dangerous combination
if var_ratio > 2 and n_ratio > 1.5:
# Determine which group is smaller with larger variance
smaller_group = 1 if n1 < n2 else 2
larger_var_group = 1 if var1 > var2 else 2
if smaller_group == larger_var_group:
recommendation += '\n ⚠️ WARNING: Smaller group has larger variance - Type I error inflated!'
return {
'levene_stat': stat,
'levene_p': p,
'var_ratio': var_ratio,
'variances': (var1, var2),
'sample_sizes': (n1, n2),
'n_ratio': n_ratio,
'severity': severity,
'recommendation': recommendation
}
# Example
np.random.seed(42)
g1 = np.random.normal(50, 5, 20) # Smaller n, smaller variance
g2 = np.random.normal(50, 15, 50) # Larger n, larger variance
result = test_equal_variance(g1, g2)
print("Variance Equality Assessment:")
print("-" * 40)
print(f"Variances: {result['variances'][0]:.2f}, {result['variances'][1]:.2f}")
print(f"Variance ratio: {result['var_ratio']:.2f}")
print(f"Sample sizes: {result['sample_sizes']}")
print(f"Levene's test: p = {result['levene_p']:.4f}")
print(f"Severity: {result['severity']}")
print(f"Recommendation: {result['recommendation']}")
Rule of Thumb Table
| Variance Ratio | With Equal n | With Unequal n |
|---|---|---|
| < 2 | Fine | Usually fine |
| 2-3 | Fine | Consider Welch |
| 3-4 | Consider Welch | Definitely Welch |
| > 4 | Use Welch | Welch + caution |
Welch's T-Test: The Solution
How It Works
Welch's t-test doesn't pool variances. It uses separate variance estimates:
$$t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$
With Welch-Satterthwaite degrees of freedom:
$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$$
def compare_tests(group1, group2):
"""
Compare Student's and Welch's t-test.
"""
# Student's t-test
t_student, p_student = stats.ttest_ind(group1, group2, equal_var=True)
df_student = len(group1) + len(group2) - 2
# Welch's t-test
t_welch, p_welch = stats.ttest_ind(group1, group2, equal_var=False)
# Calculate Welch df
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
n1, n2 = len(group1), len(group2)
num = (var1/n1 + var2/n2)**2
denom = (var1/n1)**2/(n1-1) + (var2/n2)**2/(n2-1)
df_welch = num / denom
print("Test Comparison:")
print("-" * 50)
print(f"{'':20} {'Student':>12} {'Welch':>12}")
print("-" * 50)
print(f"{'t-statistic':20} {t_student:>12.4f} {t_welch:>12.4f}")
print(f"{'df':20} {df_student:>12.1f} {df_welch:>12.1f}")
print(f"{'p-value':20} {p_student:>12.4f} {p_welch:>12.4f}")
# Variance info
print(f"\nVariance ratio: {max(var1, var2)/min(var1, var2):.2f}")
print(f"Sample sizes: {n1}, {n2}")
# Example with very unequal variances
np.random.seed(42)
g1 = np.random.normal(50, 5, 25)
g2 = np.random.normal(55, 20, 40)
compare_tests(g1, g2)
Why Welch Should Be Default
def welch_as_default_justification(n_sims=10000):
"""
Show that Welch never does much worse and often does better.
"""
np.random.seed(42)
scenarios = [
# (n1, n2, sd1, sd2, true_diff)
(30, 30, 10, 10, 5), # Equal var, equal n
(30, 30, 10, 30, 5), # Unequal var, equal n
(15, 45, 30, 10, 5), # Smaller group, larger var
(45, 15, 30, 10, 5), # Larger group, larger var
]
results = []
for n1, n2, sd1, sd2, diff in scenarios:
student_power = 0
welch_power = 0
for _ in range(n_sims):
g1 = np.random.normal(50, sd1, n1)
g2 = np.random.normal(50 + diff, sd2, n2)
_, p_student = stats.ttest_ind(g1, g2, equal_var=True)
_, p_welch = stats.ttest_ind(g1, g2, equal_var=False)
if p_student < 0.05:
student_power += 1
if p_welch < 0.05:
welch_power += 1
results.append({
'scenario': f'n=({n1},{n2}), SD=({sd1},{sd2})',
'Student_power': student_power / n_sims,
'Welch_power': welch_power / n_sims,
'var_ratio': max(sd1, sd2)**2 / min(sd1, sd2)**2
})
print("Power Comparison (true effect = 5):")
print("=" * 70)
print(f"{'Scenario':<30} {'Student':>12} {'Welch':>12} {'Var Ratio':>10}")
print("-" * 70)
for r in results:
diff = r['Welch_power'] - r['Student_power']
marker = f" ({diff:+.1%})" if abs(diff) > 0.02 else ""
print(f"{r['scenario']:<30} {r['Student_power']:>12.1%} "
f"{r['Welch_power']:>12.1%} {r['var_ratio']:>10.1f}{marker}")
welch_as_default_justification()
Multiple Groups: Welch's ANOVA
The Extension
For more than two groups, use Welch's ANOVA instead of standard one-way ANOVA.
from scipy.stats import f_oneway, alexandergovern
def compare_anova_methods(*groups):
"""
Compare standard ANOVA with Welch's ANOVA.
"""
# Standard one-way ANOVA
f_stat, p_standard = f_oneway(*groups)
# Welch's ANOVA (Alexander-Govern test)
result = alexandergovern(*groups)
# Variance info
variances = [np.var(g, ddof=1) for g in groups]
var_ratio = max(variances) / min(variances)
print("ANOVA Comparison:")
print("-" * 40)
print(f"Standard ANOVA: F = {f_stat:.3f}, p = {p_standard:.4f}")
print(f"Welch's ANOVA: stat = {result.statistic:.3f}, p = {result.pvalue:.4f}")
print(f"\nVariance ratio: {var_ratio:.2f}")
print(f"Variances: {[f'{v:.1f}' for v in variances]}")
# Example
np.random.seed(42)
g1 = np.random.normal(50, 5, 30)
g2 = np.random.normal(52, 15, 25)
g3 = np.random.normal(55, 10, 35)
compare_anova_methods(g1, g2, g3)
Post-Hoc: Games-Howell
When using Welch's ANOVA, use Games-Howell for post-hoc comparisons:
def games_howell_comparison(groups, names):
"""
Pairwise comparisons without equal variance assumption.
"""
import scikit_posthocs as sp
all_data = np.concatenate(groups)
labels = np.repeat(names, [len(g) for g in groups])
df = pd.DataFrame({'value': all_data, 'group': labels})
# Welch t-tests with correction
result = sp.posthoc_ttest(
df, val_col='value', group_col='group',
equal_var=False, p_adjust='holm'
)
return result
Practical Workflow
def complete_variance_workflow(group1, group2, name1='Group 1', name2='Group 2'):
"""
Complete workflow for handling variance assumption.
"""
print("=" * 60)
print("VARIANCE ASSUMPTION WORKFLOW")
print("=" * 60)
# 1. Descriptives
print("\n1. DESCRIPTIVE STATISTICS")
print("-" * 40)
for name, g in [(name1, group1), (name2, group2)]:
print(f"\n{name}:")
print(f" n = {len(g)}")
print(f" Mean = {np.mean(g):.3f}")
print(f" SD = {np.std(g, ddof=1):.3f}")
print(f" Variance = {np.var(g, ddof=1):.3f}")
# 2. Variance assessment
print("\n\n2. VARIANCE ASSESSMENT")
print("-" * 40)
var1 = np.var(group1, ddof=1)
var2 = np.var(group2, ddof=1)
var_ratio = max(var1, var2) / min(var1, var2)
stat, p_levene = levene(group1, group2, center='median')
print(f"Variance ratio: {var_ratio:.2f}")
print(f"Levene's test: p = {p_levene:.4f}")
if var_ratio < 2:
print("Assessment: Variances are reasonably similar")
elif var_ratio < 4:
print("Assessment: Moderate variance inequality")
else:
print("Assessment: Substantial variance inequality")
# 3. Both tests
print("\n\n3. ANALYSIS RESULTS")
print("-" * 40)
t_student, p_student = stats.ttest_ind(group1, group2, equal_var=True)
t_welch, p_welch = stats.ttest_ind(group1, group2, equal_var=False)
print(f"Student's t: t = {t_student:.3f}, p = {p_student:.4f}")
print(f"Welch's t: t = {t_welch:.3f}, p = {p_welch:.4f}")
# 4. Recommendation
print("\n\n4. RECOMMENDATION")
print("-" * 40)
if var_ratio > 2:
print(f"Use Welch's result: p = {p_welch:.4f}")
print("(Variance ratio > 2 suggests unequal variances)")
elif abs(p_student - p_welch) > 0.01:
print(f"Use Welch's result: p = {p_welch:.4f}")
print("(Results differ; Welch is more robust)")
else:
print(f"Either result is fine: p ≈ {p_welch:.4f}")
print("(Welch recommended as safe default)")
# 5. Effect size
print("\n\n5. EFFECT SIZE")
print("-" * 40)
pooled_std = np.sqrt(((len(group1)-1)*var1 + (len(group2)-1)*var2) /
(len(group1) + len(group2) - 2))
cohens_d = (np.mean(group1) - np.mean(group2)) / pooled_std
print(f"Cohen's d = {cohens_d:.3f}")
print("\n" + "=" * 60)
# Example
np.random.seed(42)
control = np.random.normal(50, 8, 35)
treatment = np.random.normal(55, 18, 25)
complete_variance_workflow(control, treatment, 'Control', 'Treatment')
R Implementation
variance_workflow <- function(group1, group2) {
cat("VARIANCE ASSUMPTION WORKFLOW\n")
cat(rep("=", 50), "\n\n")
# Descriptives
cat("1. DESCRIPTIVES\n")
cat("Group 1: n =", length(group1), ", SD =", round(sd(group1), 3), "\n")
cat("Group 2: n =", length(group2), ", SD =", round(sd(group2), 3), "\n")
# Variance ratio
var_ratio <- max(var(group1), var(group2)) / min(var(group1), var(group2))
cat("\nVariance ratio:", round(var_ratio, 2), "\n")
# Levene's test
library(car)
df <- data.frame(
value = c(group1, group2),
group = factor(rep(c("G1", "G2"), c(length(group1), length(group2))))
)
lev <- leveneTest(value ~ group, data = df, center = median)
cat("Levene's test p-value:", round(lev$`Pr(>F)`[1], 4), "\n")
# Both tests
cat("\n2. T-TEST RESULTS\n")
cat("Student's:", round(t.test(group1, group2, var.equal = TRUE)$p.value, 4), "\n")
cat("Welch's:", round(t.test(group1, group2, var.equal = FALSE)$p.value, 4), "\n")
# Recommendation
cat("\n3. RECOMMENDATION\n")
if (var_ratio > 2) {
cat("Use Welch's t-test (variance ratio > 2)\n")
} else {
cat("Either test OK, but Welch is safe default\n")
}
}
# Usage
# group1 <- rnorm(30, 50, 5)
# group2 <- rnorm(25, 52, 15)
# variance_workflow(group1, group2)
Related Methods
- Assumption Checks Master Guide — The pillar article
- Welch's vs. Student's T-Test — Detailed comparison
- Heteroskedastic Groups — Multi-group solutions
- Comparing Variances — Testing variance equality
Key Takeaway
The equal variance assumption matters more than normality for t-tests. When violated alongside unequal sample sizes, standard tests can be badly wrong. Welch's t-test handles this correctly with almost no cost when variances are actually equal. Make Welch your default—it's robust to the assumption and performs nearly identically when the assumption holds.
References
- https://www.jstor.org/stable/2529310
- https://www.jstor.org/stable/2684452
- Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch's t-test instead of Student's t-test. *International Review of Social Psychology*, 30(1).
- Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student's t-test and the Mann-Whitney U test. *Behavioral Ecology*, 17(4), 688-690.
- Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. *British Journal of Mathematical and Statistical Psychology*, 57(1), 173-181.
Frequently Asked Questions
Should I always use Welch's t-test?
How do I know if variances are unequal?
Does sample size matter for this assumption?
Key Takeaway
The equal variance assumption matters more than normality for t-tests and ANOVA. When variances are unequal and sample sizes differ, standard tests can have inflated Type I error (when smaller groups have larger variance) or reduced power (opposite case). Welch's t-test handles this correctly and should be your default choice.