Contents
Comparing More Than Two Groups: A Complete Guide
How to compare means, medians, and distributions across three or more groups. Covers ANOVA, Kruskal-Wallis, post-hoc tests, and when each method is appropriate.
Quick Hits
- •ANOVA tests whether any group differs—it doesn't tell you which groups differ
- •Always follow significant ANOVA with post-hoc tests to identify specific differences
- •Kruskal-Wallis is the non-parametric alternative, testing stochastic dominance not means
- •Welch's ANOVA handles unequal variances—use it as your default
TL;DR
Comparing three or more groups requires a two-stage approach: first test whether any differences exist (omnibus test), then identify which specific groups differ (post-hoc tests). ANOVA handles the first stage for means; Kruskal-Wallis for ranks. For post-hoc comparisons, Tukey's HSD handles all pairwise comparisons, Dunnett's compares to a control, and Games-Howell handles unequal variances.
Why Not Just Run Multiple T-Tests?
With 5 groups, you'd need 10 pairwise t-tests. Each test at α = 0.05 has a 5% false positive rate. Across 10 independent tests (assuming all nulls true):
$$P(\text{at least one false positive}) = 1 - (1-0.05)^{10} = 0.40$$
You have a 40% chance of finding at least one "significant" difference by chance alone.
ANOVA provides a single omnibus test controlling the overall Type I error rate.
The Two-Stage Framework
Stage 1: Omnibus Test
Question: "Is there any difference among groups?"
- ANOVA for means
- Kruskal-Wallis for ranks
If p > α: Stop. No evidence of differences. If p < α: Proceed to post-hoc.
Stage 2: Post-Hoc Comparisons
Question: "Which specific groups differ?"
- Tukey's HSD for all pairwise comparisons
- Dunnett's for comparing to control
- Games-Howell for unequal variances
One-Way ANOVA
Tests whether group means differ by comparing between-group variance to within-group variance.
The F-Test
$$F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{MS_{between}}{MS_{within}}$$
Large F indicates groups differ more than expected from random variation.
Python Implementation
import numpy as np
from scipy import stats
import pandas as pd
def one_way_anova(groups):
"""
One-way ANOVA for comparing group means.
groups: list of arrays, one per group
"""
# F-test
f_stat, p_value = stats.f_oneway(*groups)
# Effect size (eta-squared)
grand_mean = np.mean(np.concatenate(groups))
ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
ss_total = sum(np.sum((g - grand_mean)**2) for g in groups)
eta_squared = ss_between / ss_total
# Group statistics
group_stats = []
for i, g in enumerate(groups):
group_stats.append({
'group': i + 1,
'n': len(g),
'mean': np.mean(g),
'std': np.std(g, ddof=1)
})
return {
'f_statistic': f_stat,
'p_value': p_value,
'eta_squared': eta_squared,
'group_stats': pd.DataFrame(group_stats)
}
# Example: comparing 4 treatment groups
np.random.seed(42)
group1 = np.random.normal(10, 2, 30)
group2 = np.random.normal(11, 2, 30)
group3 = np.random.normal(10, 2, 30)
group4 = np.random.normal(13, 2, 30)
result = one_way_anova([group1, group2, group3, group4])
print(f"F-statistic: {result['f_statistic']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Eta-squared: {result['eta_squared']:.3f}")
print(result['group_stats'])
R Implementation
# Create data frame
df <- data.frame(
value = c(group1, group2, group3, group4),
group = factor(rep(1:4, each = 30))
)
# One-way ANOVA
model <- aov(value ~ group, data = df)
summary(model)
# Effect size
library(effectsize)
eta_squared(model)
Welch's ANOVA
Standard ANOVA assumes equal variances. Welch's ANOVA relaxes this assumption—use it as your default.
def welch_anova(groups):
"""
Welch's ANOVA for unequal variances.
"""
from scipy.stats import f_oneway
from scipy.stats import alexandergovern
# Welch's test (Alexander-Govern approximation in scipy)
result = stats.alexandergovern(*groups)
return {
'statistic': result.statistic,
'p_value': result.pvalue
}
# With unequal variances
group1 = np.random.normal(10, 1, 30) # SD = 1
group2 = np.random.normal(11, 3, 30) # SD = 3 (different!)
group3 = np.random.normal(10, 2, 30) # SD = 2
standard = one_way_anova([group1, group2, group3])
welch = stats.alexandergovern(group1, group2, group3)
print(f"Standard ANOVA p: {standard['p_value']:.4f}")
print(f"Welch's ANOVA p: {welch.pvalue:.4f}")
R Implementation
# Welch's ANOVA (default in newer R)
oneway.test(value ~ group, data = df, var.equal = FALSE)
Kruskal-Wallis Test
Non-parametric alternative when normality is questionable or data is ordinal.
def kruskal_wallis(groups):
"""
Kruskal-Wallis test for comparing distributions.
"""
stat, p_value = stats.kruskal(*groups)
return {
'h_statistic': stat,
'p_value': p_value
}
result = kruskal_wallis([group1, group2, group3, group4])
print(f"H-statistic: {result['h_statistic']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
What It Actually Tests
Like Mann-Whitney, Kruskal-Wallis tests stochastic dominance across groups—whether groups tend to produce systematically higher or lower values—not whether means differ.
Post-Hoc Tests
After rejecting the ANOVA null, identify which groups differ.
Tukey's HSD (Honestly Significant Difference)
For all pairwise comparisons with equal variances.
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
def tukey_hsd(data, group_labels):
"""
Tukey's HSD for all pairwise comparisons.
data: 1D array of all values
group_labels: 1D array of group membership
"""
result = pairwise_tukeyhsd(data, group_labels)
return result
# Create data for Tukey's
all_data = np.concatenate([group1, group2, group3, group4])
labels = np.repeat(['A', 'B', 'C', 'D'], 30)
tukey_result = tukey_hsd(all_data, labels)
print(tukey_result)
R Implementation
# Tukey's HSD
TukeyHSD(model)
Dunnett's Test
For comparing all groups to a single control.
from scipy.stats import dunnett
def dunnett_test(control, treatments):
"""
Dunnett's test: compare each treatment to control.
"""
result = stats.dunnett(*treatments, control=control)
return result
# Group 1 is control
control = group1
treatments = [group2, group3, group4]
result = dunnett_test(control, treatments)
print(f"Dunnett's test statistics: {result.statistic}")
print(f"P-values: {result.pvalue}")
Games-Howell
For pairwise comparisons with unequal variances.
import scikit_posthocs as sp
def games_howell(data, group_labels):
"""
Games-Howell test for unequal variances.
"""
df = pd.DataFrame({'value': data, 'group': group_labels})
result = sp.posthoc_ttest(df, val_col='value', group_col='group',
equal_var=False, p_adjust='holm')
return result
# Using scikit-posthocs
result = games_howell(all_data, labels)
print(result)
Post-Hoc Decision Guide
| Situation | Recommended Test |
|---|---|
| All pairwise, equal variances | Tukey's HSD |
| All pairwise, unequal variances | Games-Howell |
| Compare to control only | Dunnett's |
| Ordered groups (dose-response) | Linear contrast or Jonckheere-Terpstra |
| Non-parametric follow-up | Dunn's test |
Effect Sizes
Always report effect sizes alongside p-values.
Eta-Squared (η²)
Proportion of variance explained by group membership.
$$\eta^2 = \frac{SS_{between}}{SS_{total}}$$
| η² | Interpretation |
|---|---|
| 0.01 | Small |
| 0.06 | Medium |
| 0.14 | Large |
Omega-Squared (ω²)
Less biased estimate of population effect size.
$$\omega^2 = \frac{SS_{between} - (k-1)MS_{within}}{SS_{total} + MS_{within}}$$
def omega_squared(groups):
"""Calculate omega-squared effect size."""
k = len(groups)
n_total = sum(len(g) for g in groups)
grand_mean = np.mean(np.concatenate(groups))
ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
ss_within = sum(np.sum((g - np.mean(g))**2) for g in groups)
ss_total = ss_between + ss_within
ms_within = ss_within / (n_total - k)
omega_sq = (ss_between - (k - 1) * ms_within) / (ss_total + ms_within)
return omega_sq
omega = omega_squared([group1, group2, group3, group4])
print(f"Omega-squared: {omega:.3f}")
Assumptions and Diagnostics
ANOVA Assumptions
- Independence: Observations are independent
- Normality: Data within each group is approximately normal
- Homogeneity of variance: Equal variances across groups
Checking Assumptions
from scipy.stats import levene, shapiro
def anova_diagnostics(groups):
"""Check ANOVA assumptions."""
# Levene's test for equal variances
levene_stat, levene_p = levene(*groups)
# Shapiro-Wilk for normality (per group)
normality_tests = []
for i, g in enumerate(groups):
if len(g) >= 3 and len(g) <= 5000:
stat, p = shapiro(g)
normality_tests.append({'group': i+1, 'statistic': stat, 'p_value': p})
return {
'levene': {'statistic': levene_stat, 'p_value': levene_p},
'normality': pd.DataFrame(normality_tests)
}
diagnostics = anova_diagnostics([group1, group2, group3, group4])
print(f"Levene's test p-value: {diagnostics['levene']['p_value']:.4f}")
print("Normality tests:")
print(diagnostics['normality'])
When Assumptions Fail
| Violation | Solution |
|---|---|
| Unequal variances | Use Welch's ANOVA + Games-Howell |
| Non-normality | Kruskal-Wallis + Dunn's test, or bootstrap |
| Outliers | Robust ANOVA, trimmed means |
| Non-independence | Mixed models, cluster adjustments |
Practical Workflow
def full_group_comparison(groups, group_names=None, alpha=0.05):
"""
Complete workflow for comparing multiple groups.
"""
if group_names is None:
group_names = [f'Group{i+1}' for i in range(len(groups))]
# 1. Descriptive statistics
desc = pd.DataFrame({
'group': group_names,
'n': [len(g) for g in groups],
'mean': [np.mean(g) for g in groups],
'std': [np.std(g, ddof=1) for g in groups],
'median': [np.median(g) for g in groups]
})
# 2. Check assumptions
levene_stat, levene_p = levene(*groups)
equal_var = levene_p > alpha
# 3. Omnibus test
if equal_var:
f_stat, anova_p = stats.f_oneway(*groups)
test_used = 'Standard ANOVA'
else:
welch_result = stats.alexandergovern(*groups)
f_stat, anova_p = welch_result.statistic, welch_result.pvalue
test_used = "Welch's ANOVA"
# 4. Post-hoc (if significant)
posthoc_results = None
if anova_p < alpha:
all_data = np.concatenate(groups)
labels = np.repeat(group_names, [len(g) for g in groups])
posthoc_results = pairwise_tukeyhsd(all_data, labels)
# 5. Effect size
eta_sq = omega_squared(groups)
return {
'descriptives': desc,
'test_used': test_used,
'equal_variance': equal_var,
'f_statistic': f_stat,
'p_value': anova_p,
'significant': anova_p < alpha,
'effect_size': eta_sq,
'posthoc': posthoc_results
}
result = full_group_comparison(
[group1, group2, group3, group4],
['Control', 'Low Dose', 'Medium Dose', 'High Dose']
)
print(f"Test used: {result['test_used']}")
print(f"F = {result['f_statistic']:.2f}, p = {result['p_value']:.4f}")
print(f"Effect size (omega²): {result['effect_size']:.3f}")
if result['posthoc']:
print("\nPost-hoc comparisons:")
print(result['posthoc'])
Related Articles
- One-Way ANOVA: Assumptions, Effect Sizes, and Reporting
- Kruskal-Wallis: When It's Appropriate and Post-Hoc Strategy
- Post-Hoc Tests: Tukey, Dunnett, Games-Howell Decision Tree
- Heteroskedastic Groups: Games-Howell
- Two-Way ANOVA vs. Regression: Interactions in Product
- Trend Across Ordered Groups: Jonckheere-Terpstra
- Controlling Covariates: ANCOVA vs. Regression
- Visual Diagnostics for Group Comparisons
Frequently Asked Questions
Q: Should I use ANOVA even if the omnibus test is non-significant? A: If the omnibus test is non-significant, you generally shouldn't proceed to post-hoc tests. However, you might report descriptive statistics and consider whether you were adequately powered.
Q: Can I skip the omnibus test and go directly to post-hoc? A: Some argue this is acceptable if you use appropriate corrections (like Tukey's). The omnibus test adds a layer of protection against false positives, but isn't strictly required.
Q: My sample sizes are very unequal. Does that matter? A: Unequal sample sizes make ANOVA more sensitive to variance heterogeneity. Use Welch's ANOVA and Games-Howell post-hoc to be safe.
Q: Can ANOVA compare medians? A: No. ANOVA compares means. For medians, use Kruskal-Wallis (though it tests stochastic dominance, not medians specifically) or quantile regression.
Key Takeaway
When comparing more than two groups, start with an omnibus test (ANOVA or Kruskal-Wallis) to see if any differences exist. If significant, use appropriate post-hoc tests to identify which specific groups differ. Choose your post-hoc method based on your question (all pairs vs. vs-control) and assumptions (equal variances or not). Always report effect sizes alongside p-values.
References
- https://www.jstor.org/stable/2684445
- https://www.jstor.org/stable/3001968
- Fisher, R. A. (1925). *Statistical Methods for Research Workers*. Oliver and Boyd.
- Tukey, J. W. (1953). The problem of multiple comparisons. Unpublished manuscript, Princeton University.
- Games, P. A., & Howell, J. F. (1976). Pairwise multiple comparison procedures with unequal n's and/or variances: a Monte Carlo study. *Journal of Educational Statistics*, 1(2), 113-125.
Frequently Asked Questions
Why not just run multiple t-tests?
What's the difference between ANOVA and Kruskal-Wallis?
Which post-hoc test should I use?
Key Takeaway
When comparing more than two groups, start with an omnibus test (ANOVA or Kruskal-Wallis) to see if any differences exist. If significant, use appropriate post-hoc tests to identify which specific groups differ. Choose your post-hoc method based on your question (all pairs vs. vs-control) and assumptions (equal variances or not).