Multi-Group Comparisons

Post-Hoc Tests: Tukey, Dunnett, and Games-Howell Decision Tree

How to choose the right post-hoc test after ANOVA. Covers Tukey's HSD, Dunnett's test, Games-Howell, Scheffé, and provides a clear decision tree for selection.

Jan 267 min readstatstest_flow Multi-Group Comparisons Supporting

Post-Hoc Tests: Tukey, Dunnett, and Games-Howell Decision Tree

Quick Hits

•Tukey's HSD is the default for all pairwise comparisons with equal variances
•Dunnett's is more powerful when you only care about comparisons to control
•Games-Howell handles unequal variances and unequal sample sizes
•Scheffé controls for ANY comparison, including complex contrasts—very conservative

TL;DR

After a significant ANOVA, post-hoc tests identify which groups differ. Tukey's HSD handles all pairwise comparisons when variances are equal. Dunnett's is more powerful when comparing treatments to a single control. Games-Howell handles unequal variances. Scheffé allows any comparison but is conservative. Choose based on your specific question and data characteristics.

The Decision Tree

Significant ANOVA → Which comparisons do you need?
│
├── Only comparing to control
│   └── Use DUNNETT'S TEST
│
├── All pairwise comparisons
│   │
│   ├── Equal variances?
│   │   ├── Yes → TUKEY'S HSD
│   │   └── No → GAMES-HOWELL
│   │
│   └── Very unequal sample sizes?
│       └── GAMES-HOWELL (safer)
│
├── Complex contrasts (not just pairs)
│   └── SCHEFFÉ
│
└── Pre-planned specific comparisons
    └── BONFERRONI or HOLM (on t-tests)

Tukey's HSD (Honestly Significant Difference)

The standard choice for all pairwise comparisons with equal variances.

How It Works

Compares each pair of means using a studentized range distribution, controlling the family-wise error rate across all comparisons.

Python Implementation

from statsmodels.stats.multicomp import pairwise_tukeyhsd
import numpy as np
import pandas as pd

def tukey_hsd(groups, group_names=None, alpha=0.05):
    """
    Tukey's HSD for all pairwise comparisons.
    """
    if group_names is None:
        group_names = [f'Group{i+1}' for i in range(len(groups))]

    all_data = np.concatenate(groups)
    labels = np.repeat(group_names, [len(g) for g in groups])

    result = pairwise_tukeyhsd(all_data, labels, alpha=alpha)

    # Convert to dataframe for easier reading
    summary_df = pd.DataFrame({
        'Group 1': result._results_table.data[1:, 0],
        'Group 2': result._results_table.data[1:, 1],
        'Mean Diff': result._results_table.data[1:, 2],
        'p-adj': result._results_table.data[1:, 3],
        'Lower CI': result._results_table.data[1:, 4],
        'Upper CI': result._results_table.data[1:, 5],
        'Significant': result._results_table.data[1:, 6]
    })

    return result, summary_df


# Example
np.random.seed(42)
control = np.random.normal(50, 10, 25)
treatment_a = np.random.normal(55, 10, 25)
treatment_b = np.random.normal(52, 10, 25)
treatment_c = np.random.normal(58, 10, 25)

result, df = tukey_hsd(
    [control, treatment_a, treatment_b, treatment_c],
    ['Control', 'Treat_A', 'Treat_B', 'Treat_C']
)

print("Tukey's HSD Results:")
print(df.to_string(index=False))

R Implementation

# After ANOVA
model <- aov(value ~ group, data = df)
TukeyHSD(model)

# Visualize
plot(TukeyHSD(model))

When to Use

All pairwise comparisons needed
Equal (or nearly equal) variances
Reasonably balanced sample sizes

Dunnett's Test

Compares all treatments to a single control—more powerful than Tukey's for this specific situation.

Why More Powerful?

With k groups, Tukey's makes k(k-1)/2 comparisons. Dunnett's makes only k-1 comparisons (each treatment vs. control). Fewer comparisons = less severe correction = more power.

Python Implementation

from scipy.stats import dunnett

def dunnett_test(control, treatments, treatment_names=None):
    """
    Dunnett's test: compare treatments to control.
    """
    if treatment_names is None:
        treatment_names = [f'Treatment{i+1}' for i in range(len(treatments))]

    result = dunnett(*treatments, control=control)

    summary = pd.DataFrame({
        'Treatment': treatment_names,
        'Statistic': result.statistic,
        'p-value': result.pvalue,
        'Significant': result.pvalue < 0.05
    })

    return result, summary


# Control vs treatments
control = np.random.normal(50, 10, 25)
treatments = [
    np.random.normal(55, 10, 25),
    np.random.normal(52, 10, 25),
    np.random.normal(58, 10, 25)
]

result, df = dunnett_test(control, treatments, ['A', 'B', 'C'])
print("Dunnett's Test Results:")
print(df.to_string(index=False))

R Implementation

library(multcomp)
model <- aov(value ~ group, data = df)
dunnett <- glht(model, linfct = mcp(group = "Dunnett"))
summary(dunnett)

When to Use

Comparing multiple treatments to one control
Don't need treatment-to-treatment comparisons
Want maximum power for your specific question

Games-Howell

For unequal variances and/or unequal sample sizes.

How It Works

Like Tukey's but doesn't assume equal variances—uses separate variance estimates for each comparison (similar to Welch's t-test).

Python Implementation

import scikit_posthocs as sp

def games_howell(groups, group_names=None):
    """
    Games-Howell test for unequal variances.
    """
    if group_names is None:
        group_names = [f'Group{i+1}' for i in range(len(groups))]

    all_data = np.concatenate(groups)
    labels = np.repeat(group_names, [len(g) for g in groups])
    df = pd.DataFrame({'value': all_data, 'group': labels})

    result = sp.posthoc_ttest(df, val_col='value', group_col='group',
                               equal_var=False, p_adjust='holm')

    return result


# Groups with unequal variances
group1 = np.random.normal(50, 5, 20)   # SD = 5
group2 = np.random.normal(55, 15, 25)  # SD = 15 (larger!)
group3 = np.random.normal(52, 10, 30)  # SD = 10

result = games_howell([group1, group2, group3], ['A', 'B', 'C'])
print("Games-Howell Results (p-values):")
print(result)

R Implementation

library(rstatix)
games_howell_test(df, value ~ group)

# Or using PMCMRplus
library(PMCMRplus)
gamesHowellTest(value ~ group, data = df)

When to Use

Levene's test shows unequal variances
Substantially different sample sizes
As a robust default when uncertain about variance equality

Scheffé's Test

The most conservative test—controls for ANY possible comparison, including complex contrasts.

When to Use

You might want to test contrasts you didn't pre-specify
Exploratory analysis where any comparison is possible
You need maximum protection against false positives

Python Implementation

from statsmodels.stats.multicomp import MultiComparison

def scheffe_test(groups, group_names=None):
    """
    Scheffé's test for post-hoc comparisons.
    """
    if group_names is None:
        group_names = [f'Group{i+1}' for i in range(len(groups))]

    all_data = np.concatenate(groups)
    labels = np.repeat(group_names, [len(g) for g in groups])

    mc = MultiComparison(all_data, labels)
    result = mc.allpairtest(stats.ttest_ind, method='s')  # Scheffé

    return result


# Note: Scheffé is very conservative for pairwise comparisons

R Implementation

library(DescTools)
ScheffeTest(model)

Warning

Scheffé is designed for post-hoc exploration of any contrast. For planned pairwise comparisons, it's overly conservative—use Tukey's instead.

Comparison Table

Test	Best For	Assumes Equal Variance	Power	Type I Error Control
Tukey's HSD	All pairwise	Yes	High	FWER for all pairs
Dunnett's	vs. Control	Yes	Highest (for its purpose)	FWER for control comparisons
Games-Howell	All pairwise	No	Moderate	FWER for all pairs
Scheffé	Any contrast	Yes	Low	FWER for ANY comparison
Bonferroni	Pre-planned	Either	Varies	FWER

Power Comparison

Same data, different tests:

def compare_posthoc_methods(groups, group_names):
    """Compare different post-hoc methods on same data."""
    all_data = np.concatenate(groups)
    labels = np.repeat(group_names, [len(g) for g in groups])

    print("Method Comparison (adjusted p-values for A vs B):")

    # Tukey's HSD
    tukey = pairwise_tukeyhsd(all_data, labels)
    print(f"  Tukey's HSD: p = {tukey.pvalues[0]:.4f}")

    # Bonferroni (manual)
    from scipy.stats import ttest_ind
    _, p_raw = ttest_ind(groups[0], groups[1])
    n_comparisons = len(groups) * (len(groups) - 1) // 2
    p_bonf = min(1, p_raw * n_comparisons)
    print(f"  Bonferroni: p = {p_bonf:.4f}")

    # Games-Howell approximation
    df = pd.DataFrame({'value': all_data, 'group': labels})
    gh = sp.posthoc_ttest(df, val_col='value', group_col='group',
                          equal_var=False, p_adjust='holm')
    print(f"  Games-Howell (Holm): p = {gh.iloc[0, 1]:.4f}")


compare_posthoc_methods(
    [control, treatment_a, treatment_b, treatment_c],
    ['Control', 'A', 'B', 'C']
)

Common Mistakes

Using Bonferroni for All Pairwise

Bonferroni is more conservative than Tukey's for pairwise comparisons. Use Tukey's instead.

Ignoring Variance Heterogeneity

If Levene's test is significant, Tukey's HSD may have inflated Type I error. Use Games-Howell.

Post-Hoc Without Significant ANOVA

Technically, if ANOVA is non-significant, you shouldn't interpret post-hoc results (though some argue for pre-planned comparisons regardless).

Using Scheffé for Simple Pairwise

Scheffé is for any possible contrast. For pairwise comparisons, it's unnecessarily conservative.

Practical Workflow

def posthoc_workflow(groups, group_names, alpha=0.05):
    """
    Complete post-hoc analysis workflow.
    """
    from scipy.stats import levene, f_oneway

    # 1. Check ANOVA significance
    f_stat, anova_p = f_oneway(*groups)
    print(f"Step 1: ANOVA F = {f_stat:.2f}, p = {anova_p:.4f}")

    if anova_p >= alpha:
        print("  ANOVA not significant. Post-hoc not warranted.")
        return None

    # 2. Check variance homogeneity
    levene_stat, levene_p = levene(*groups)
    equal_var = levene_p > alpha
    print(f"\nStep 2: Levene's test p = {levene_p:.4f}")
    print(f"  Variances appear {'equal' if equal_var else 'unequal'}")

    # 3. Choose and run post-hoc
    all_data = np.concatenate(groups)
    labels = np.repeat(group_names, [len(g) for g in groups])

    if equal_var:
        print("\nStep 3: Using Tukey's HSD")
        result = pairwise_tukeyhsd(all_data, labels, alpha=alpha)
        print(result)
    else:
        print("\nStep 3: Using Games-Howell")
        df = pd.DataFrame({'value': all_data, 'group': labels})
        result = sp.posthoc_ttest(df, val_col='value', group_col='group',
                                   equal_var=False, p_adjust='holm')
        print(result)

    return result


posthoc_workflow(
    [control, treatment_a, treatment_b, treatment_c],
    ['Control', 'A', 'B', 'C']
)

Comparing More Than Two Groups — The pillar guide
One-Way ANOVA — Before post-hoc
Heteroskedastic Groups: Games-Howell — Deep dive on unequal variances

Key Takeaway

Choose your post-hoc test based on your question: Dunnett's for comparisons to control (most powerful for that purpose), Tukey's for all pairwise comparisons with equal variances (the default), Games-Howell for unequal variances. Don't use Bonferroni for pairwise comparisons—Tukey's is specifically designed for that and is more powerful.

References

https://www.jstor.org/stable/2685182
https://www.jstor.org/stable/2684452
Tukey, J. W. (1953). The problem of multiple comparisons. Unpublished manuscript, Princeton University.
Dunnett, C. W. (1955). A multiple comparison procedure for comparing several treatments with a control. *Journal of the American Statistical Association*, 50(272), 1096-1121.
Games, P. A., & Howell, J. F. (1976). Pairwise multiple comparison procedures with unequal n's and/or variances. *Journal of Educational Statistics*, 1(2), 113-125.

Frequently Asked Questions

Do I need post-hoc tests if ANOVA is not significant?

Generally no. A non-significant ANOVA means you can't reject that all groups are equal. However, some argue you can proceed to pre-planned comparisons regardless.

What's the difference between Tukey's HSD and Bonferroni?

Tukey's HSD is designed specifically for all pairwise comparisons and is more powerful for that purpose. Bonferroni is more general but more conservative for pairwise comparisons.

Which test is most powerful?

Dunnett's (when comparing to control only) > Tukey's (all pairs) > Scheffé (any contrast). More specific tests are more powerful.

Key Takeaway

Choose your post-hoc test based on your question: Dunnett's for comparisons to control, Tukey's for all pairwise comparisons with equal variances, Games-Howell for unequal variances. Don't use Bonferroni for pairwise comparisons—Tukey's is more powerful.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email