Multi-Group Comparisons

Comparing More Than Two Groups: A Complete Guide

How to compare means, medians, and distributions across three or more groups. Covers ANOVA, Kruskal-Wallis, post-hoc tests, and when each method is appropriate.

Jan 268 min readstatstest_flow Multi-Group Comparisons Pillar

Comparing More Than Two Groups: A Complete Guide

Quick Hits

•ANOVA tests whether any group differs—it doesn't tell you which groups differ
•Always follow significant ANOVA with post-hoc tests to identify specific differences
•Kruskal-Wallis is the non-parametric alternative, testing stochastic dominance not means
•Welch's ANOVA handles unequal variances—use it as your default

TL;DR

Comparing three or more groups requires a two-stage approach: first test whether any differences exist (omnibus test), then identify which specific groups differ (post-hoc tests). ANOVA handles the first stage for means; Kruskal-Wallis for ranks. For post-hoc comparisons, Tukey's HSD handles all pairwise comparisons, Dunnett's compares to a control, and Games-Howell handles unequal variances.

Why Not Just Run Multiple T-Tests?

With 5 groups, you'd need 10 pairwise t-tests. Each test at α = 0.05 has a 5% false positive rate. Across 10 independent tests (assuming all nulls true):

$P(\text{at least one false positive}) = 1 - (1-0.05)^{10} = 0.40$

You have a 40% chance of finding at least one "significant" difference by chance alone.

ANOVA provides a single omnibus test controlling the overall Type I error rate.

The Two-Stage Framework

Stage 1: Omnibus Test

Question: "Is there any difference among groups?"

ANOVA for means
Kruskal-Wallis for ranks

If $p > \alpha$ : Stop. No evidence of differences. If $p < \alpha$ : Proceed to post-hoc.

Stage 2: Post-Hoc Comparisons

Question: "Which specific groups differ?"

Tukey's HSD for all pairwise comparisons
Dunnett's for comparing to control
Games-Howell for unequal variances

One-Way ANOVA

Tests whether group means differ by comparing between-group variance to within-group variance.

The F-Test

$F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{MS_{between}}{MS_{within}}$

Large F indicates groups differ more than expected from random variation.

Python Implementation

import numpy as np
from scipy import stats
import pandas as pd

def one_way_anova(groups):
    """
    One-way ANOVA for comparing group means.

    groups: list of arrays, one per group
    """
    # F-test
    f_stat, p_value = stats.f_oneway(*groups)

    # Effect size (eta-squared)
    grand_mean = np.mean(np.concatenate(groups))
    ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
    ss_total = sum(np.sum((g - grand_mean)**2) for g in groups)
    eta_squared = ss_between / ss_total

    # Group statistics
    group_stats = []
    for i, g in enumerate(groups):
        group_stats.append({
            'group': i + 1,
            'n': len(g),
            'mean': np.mean(g),
            'std': np.std(g, ddof=1)
        })

    return {
        'f_statistic': f_stat,
        'p_value': p_value,
        'eta_squared': eta_squared,
        'group_stats': pd.DataFrame(group_stats)
    }


# Example: comparing 4 treatment groups
np.random.seed(42)
group1 = np.random.normal(10, 2, 30)
group2 = np.random.normal(11, 2, 30)
group3 = np.random.normal(10, 2, 30)
group4 = np.random.normal(13, 2, 30)

result = one_way_anova([group1, group2, group3, group4])
print(f"F-statistic: {result['f_statistic']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Eta-squared: {result['eta_squared']:.3f}")
print(result['group_stats'])

R Implementation

# Create data frame
df <- data.frame(
  value = c(group1, group2, group3, group4),
  group = factor(rep(1:4, each = 30))
)

# One-way ANOVA
model <- aov(value ~ group, data = df)
summary(model)

# Effect size
library(effectsize)
eta_squared(model)

Welch's ANOVA

Standard ANOVA assumes equal variances. Welch's ANOVA relaxes this assumption—use it as your default.

def welch_anova(groups):
    """
    Welch's ANOVA for unequal variances.
    """
    from scipy.stats import f_oneway
    from scipy.stats import alexandergovern

    # Welch's test (Alexander-Govern approximation in scipy)
    result = stats.alexandergovern(*groups)

    return {
        'statistic': result.statistic,
        'p_value': result.pvalue
    }


# With unequal variances
group1 = np.random.normal(10, 1, 30)  # SD = 1
group2 = np.random.normal(11, 3, 30)  # SD = 3 (different!)
group3 = np.random.normal(10, 2, 30)  # SD = 2

standard = one_way_anova([group1, group2, group3])
welch = stats.alexandergovern(group1, group2, group3)

print(f"Standard ANOVA p: {standard['p_value']:.4f}")
print(f"Welch's ANOVA p: {welch.pvalue:.4f}")

R Implementation

# Welch's ANOVA (default in newer R)
oneway.test(value ~ group, data = df, var.equal = FALSE)

Kruskal-Wallis Test

Non-parametric alternative when normality is questionable or data is ordinal.

def kruskal_wallis(groups):
    """
    Kruskal-Wallis test for comparing distributions.
    """
    stat, p_value = stats.kruskal(*groups)

    return {
        'h_statistic': stat,
        'p_value': p_value
    }


result = kruskal_wallis([group1, group2, group3, group4])
print(f"H-statistic: {result['h_statistic']:.2f}")
print(f"P-value: {result['p_value']:.4f}")

What It Actually Tests

Like Mann-Whitney, Kruskal-Wallis tests stochastic dominance across groups—whether groups tend to produce systematically higher or lower values—not whether means differ.

Post-Hoc Tests

After rejecting the ANOVA null, identify which groups differ.

Tukey's HSD (Honestly Significant Difference)

For all pairwise comparisons with equal variances.

from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd

def tukey_hsd(data, group_labels):
    """
    Tukey's HSD for all pairwise comparisons.

    data: 1D array of all values
    group_labels: 1D array of group membership
    """
    result = pairwise_tukeyhsd(data, group_labels)
    return result


# Create data for Tukey's
all_data = np.concatenate([group1, group2, group3, group4])
labels = np.repeat(['A', 'B', 'C', 'D'], 30)

tukey_result = tukey_hsd(all_data, labels)
print(tukey_result)

R Implementation

# Tukey's HSD
TukeyHSD(model)

Dunnett's Test

For comparing all groups to a single control.

from scipy.stats import dunnett

def dunnett_test(control, treatments):
    """
    Dunnett's test: compare each treatment to control.
    """
    result = stats.dunnett(*treatments, control=control)
    return result


# Group 1 is control
control = group1
treatments = [group2, group3, group4]

result = dunnett_test(control, treatments)
print(f"Dunnett's test statistics: {result.statistic}")
print(f"P-values: {result.pvalue}")

Games-Howell

For pairwise comparisons with unequal variances.

import scikit_posthocs as sp

def games_howell(data, group_labels):
    """
    Games-Howell test for unequal variances.
    """
    df = pd.DataFrame({'value': data, 'group': group_labels})
    result = sp.posthoc_ttest(df, val_col='value', group_col='group',
                              equal_var=False, p_adjust='holm')
    return result


# Using scikit-posthocs
result = games_howell(all_data, labels)
print(result)

Post-Hoc Decision Guide

Situation	Recommended Test
All pairwise, equal variances	Tukey's HSD
All pairwise, unequal variances	Games-Howell
Compare to control only	Dunnett's
Ordered groups (dose-response)	Linear contrast or Jonckheere-Terpstra
Non-parametric follow-up	Dunn's test

Effect Sizes

Always report effect sizes alongside p-values.

Eta-Squared (η²)

Proportion of variance explained by group membership.

$\eta^2 = \frac{SS_{between}}{SS_{total}}$

η²	Interpretation
0.01	Small
0.06	Medium
0.14	Large

Omega-Squared (ω²)

Less biased estimate of population effect size.

$\omega^2 = \frac{SS_{between} - (k-1)MS_{within}}{SS_{total} + MS_{within}}$

def omega_squared(groups):
    """Calculate omega-squared effect size."""
    k = len(groups)
    n_total = sum(len(g) for g in groups)

    grand_mean = np.mean(np.concatenate(groups))
    ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
    ss_within = sum(np.sum((g - np.mean(g))**2) for g in groups)
    ss_total = ss_between + ss_within

    ms_within = ss_within / (n_total - k)

    omega_sq = (ss_between - (k - 1) * ms_within) / (ss_total + ms_within)

    return omega_sq


omega = omega_squared([group1, group2, group3, group4])
print(f"Omega-squared: {omega:.3f}")

Assumptions and Diagnostics

ANOVA Assumptions

Independence: Observations are independent
Normality: Data within each group is approximately normal
Homogeneity of variance: Equal variances across groups

Checking Assumptions

from scipy.stats import levene, shapiro

def anova_diagnostics(groups):
    """Check ANOVA assumptions."""
    # Levene's test for equal variances
    levene_stat, levene_p = levene(*groups)

    # Shapiro-Wilk for normality (per group)
    normality_tests = []
    for i, g in enumerate(groups):
        if len(g) >= 3 and len(g) <= 5000:
            stat, p = shapiro(g)
            normality_tests.append({'group': i+1, 'statistic': stat, 'p_value': p})

    return {
        'levene': {'statistic': levene_stat, 'p_value': levene_p},
        'normality': pd.DataFrame(normality_tests)
    }


diagnostics = anova_diagnostics([group1, group2, group3, group4])
print(f"Levene's test p-value: {diagnostics['levene']['p_value']:.4f}")
print("Normality tests:")
print(diagnostics['normality'])

When Assumptions Fail

Violation	Solution
Unequal variances	Use Welch's ANOVA + Games-Howell
Non-normality	Kruskal-Wallis + Dunn's test, or bootstrap
Outliers	Robust ANOVA, trimmed means
Non-independence	Mixed models, cluster adjustments

Practical Workflow

def full_group_comparison(groups, group_names=None, alpha=0.05):
    """
    Complete workflow for comparing multiple groups.
    """
    if group_names is None:
        group_names = [f'Group{i+1}' for i in range(len(groups))]

    # 1. Descriptive statistics
    desc = pd.DataFrame({
        'group': group_names,
        'n': [len(g) for g in groups],
        'mean': [np.mean(g) for g in groups],
        'std': [np.std(g, ddof=1) for g in groups],
        'median': [np.median(g) for g in groups]
    })

    # 2. Check assumptions
    levene_stat, levene_p = levene(*groups)
    equal_var = levene_p > alpha

    # 3. Omnibus test
    if equal_var:
        f_stat, anova_p = stats.f_oneway(*groups)
        test_used = 'Standard ANOVA'
    else:
        welch_result = stats.alexandergovern(*groups)
        f_stat, anova_p = welch_result.statistic, welch_result.pvalue
        test_used = "Welch's ANOVA"

    # 4. Post-hoc (if significant)
    posthoc_results = None
    if anova_p < alpha:
        all_data = np.concatenate(groups)
        labels = np.repeat(group_names, [len(g) for g in groups])
        posthoc_results = pairwise_tukeyhsd(all_data, labels)

    # 5. Effect size
    eta_sq = omega_squared(groups)

    return {
        'descriptives': desc,
        'test_used': test_used,
        'equal_variance': equal_var,
        'f_statistic': f_stat,
        'p_value': anova_p,
        'significant': anova_p < alpha,
        'effect_size': eta_sq,
        'posthoc': posthoc_results
    }


result = full_group_comparison(
    [group1, group2, group3, group4],
    ['Control', 'Low Dose', 'Medium Dose', 'High Dose']
)

print(f"Test used: {result['test_used']}")
print(f"F = {result['f_statistic']:.2f}, p = {result['p_value']:.4f}")
print(f"Effect size (omega²): {result['effect_size']:.3f}")
if result['posthoc']:
    print("\nPost-hoc comparisons:")
    print(result['posthoc'])

Frequently Asked Questions

Q: Should I use ANOVA even if the omnibus test is non-significant? A: If the omnibus test is non-significant, you generally shouldn't proceed to post-hoc tests. However, you might report descriptive statistics and consider whether you were adequately powered.

Q: Can I skip the omnibus test and go directly to post-hoc? A: Some argue this is acceptable if you use appropriate corrections (like Tukey's). The omnibus test adds a layer of protection against false positives, but isn't strictly required.

Q: My sample sizes are very unequal. Does that matter? A: Unequal sample sizes make ANOVA more sensitive to variance heterogeneity. Use Welch's ANOVA and Games-Howell post-hoc to be safe.

Q: Can ANOVA compare medians? A: No. ANOVA compares means. For medians, use Kruskal-Wallis (though it tests stochastic dominance, not medians specifically) or quantile regression.

Key Takeaway

When comparing more than two groups, start with an omnibus test (ANOVA or Kruskal-Wallis) to see if any differences exist. If significant, use appropriate post-hoc tests to identify which specific groups differ. Choose your post-hoc method based on your question (all pairs vs. vs-control) and assumptions (equal variances or not). Always report effect sizes alongside p-values.

References

https://www.jstor.org/stable/2684445
https://www.jstor.org/stable/3001968
Fisher, R. A. (1925). *Statistical Methods for Research Workers*. Oliver and Boyd.
Tukey, J. W. (1953). The problem of multiple comparisons. Unpublished manuscript, Princeton University.
Games, P. A., & Howell, J. F. (1976). Pairwise multiple comparison procedures with unequal n's and/or variances: a Monte Carlo study. *Journal of Educational Statistics*, 1(2), 113-125.

Frequently Asked Questions

Why not just run multiple t-tests?

Multiple t-tests inflate false positive rate. With 5 groups, you'd run 10 pairwise comparisons—at α=0.05, you'd expect about 0.5 false positives even with no true differences. ANOVA and proper post-hoc tests control this.

What's the difference between ANOVA and Kruskal-Wallis?

ANOVA compares means assuming normality; Kruskal-Wallis compares rank distributions without normality assumption. Like Mann-Whitney, Kruskal-Wallis tests stochastic ordering, not means.

Which post-hoc test should I use?

Tukey's HSD for all pairwise comparisons with equal variances. Games-Howell for unequal variances. Dunnett's when comparing treatments to a single control.

Key Takeaway

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email