Contents
Multiple Comparisons: When Bonferroni Is Too Conservative
A practical guide to controlling false positives when testing multiple hypotheses. Learn when Bonferroni over-corrects and better alternatives like Holm, FDR, and when to skip correction entirely.
Quick Hits
- •Bonferroni is simple but often too conservative, reducing power substantially
- •Holm's method is always as good or better than Bonferroni—no reason not to use it
- •FDR controls proportion of false positives among discoveries, not overall error rate
- •Not every multiple test situation requires correction
TL;DR
Testing multiple hypotheses inflates false positive rates—with 20 tests at α=0.05, you expect one false positive by chance. Bonferroni correction (α/k) controls this but is often too conservative, slashing power. Better alternatives: Holm's step-down method (always preferred over Bonferroni), FDR for exploratory work, or no correction when tests are truly independent decisions.
The Multiple Comparisons Problem
Why It Matters
import numpy as np
from scipy import stats
import pandas as pd
def demonstrate_multiple_testing_problem():
"""
Show how multiple testing inflates false positive rate.
"""
np.random.seed(42)
# Simulate 20 tests, all null (no real effects)
n_tests = 20
n_sims = 10000
alpha = 0.05
any_false_positive = 0
for _ in range(n_sims):
p_values = []
for _ in range(n_tests):
# Two groups with same mean (null is true)
g1 = np.random.normal(0, 1, 30)
g2 = np.random.normal(0, 1, 30)
_, p = stats.ttest_ind(g1, g2)
p_values.append(p)
# Any significant?
if min(p_values) < alpha:
any_false_positive += 1
familywise_error = any_false_positive / n_sims
expected = 1 - (1 - alpha)**n_tests
print("Multiple Testing Without Correction:")
print("-" * 50)
print(f"Number of tests: {n_tests}")
print(f"Alpha per test: {alpha}")
print(f"Expected FWER: {expected:.3f}")
print(f"Observed FWER: {familywise_error:.3f}")
print()
print(f"You have a {familywise_error:.0%} chance of at least one")
print(f"false positive, even when ALL nulls are true!")
demonstrate_multiple_testing_problem()
The Math
Probability of at least one false positive:
$$FWER = 1 - (1 - \alpha)^k$$
For k=20 tests at α=0.05: $$FWER = 1 - 0.95^{20} = 0.64$$
Correction Methods
Bonferroni
The simplest approach: divide α by the number of tests.
def bonferroni_correction(p_values, alpha=0.05):
"""
Bonferroni correction: reject if p < alpha/k.
"""
k = len(p_values)
adjusted_alpha = alpha / k
results = []
for i, p in enumerate(p_values):
results.append({
'test': i + 1,
'p_value': p,
'adjusted_p': min(p * k, 1.0), # Adjusted p-value
'significant': p < adjusted_alpha
})
return pd.DataFrame(results), adjusted_alpha
# Example
np.random.seed(42)
p_values = [0.001, 0.01, 0.02, 0.04, 0.06, 0.10, 0.15, 0.30, 0.50, 0.90]
df, adj_alpha = bonferroni_correction(p_values)
print("Bonferroni Correction:")
print("-" * 50)
print(f"Original α: 0.05")
print(f"Adjusted α: {adj_alpha:.4f}")
print()
print(df.to_string(index=False))
Why Bonferroni Is Too Conservative
def demonstrate_bonferroni_conservatism():
"""
Show how Bonferroni reduces power dramatically.
"""
np.random.seed(42)
n_sims = 5000
# 10 tests, 3 have real effects
n_tests = 10
true_effects = [0.5, 0.5, 0.5, 0, 0, 0, 0, 0, 0, 0] # First 3 are real
# Track power for each method
results = {'Uncorrected': [], 'Bonferroni': [], 'Holm': [], 'FDR': []}
for _ in range(n_sims):
p_values = []
for effect in true_effects:
g1 = np.random.normal(0, 1, 50)
g2 = np.random.normal(effect, 1, 50)
_, p = stats.ttest_ind(g1, g2)
p_values.append(p)
p_values = np.array(p_values)
# Uncorrected: just count p < 0.05 for true effects
results['Uncorrected'].append(np.sum(p_values[:3] < 0.05))
# Bonferroni
results['Bonferroni'].append(np.sum(p_values[:3] < 0.05/n_tests))
# Holm (we'll implement below)
holm_sig = holm_correction(p_values, 0.05)
results['Holm'].append(np.sum(holm_sig[:3]))
# BH-FDR
fdr_sig = benjamini_hochberg(p_values, 0.05)
results['FDR'].append(np.sum(fdr_sig[:3]))
# Average power per true effect
print("Power Comparison (3 true effects out of 10 tests):")
print("-" * 50)
for method, detections in results.items():
avg_power = np.mean(detections) / 3 # Divide by number of true effects
print(f"{method:<15}: {avg_power:.1%} average power per true effect")
def holm_correction(p_values, alpha=0.05):
"""
Holm's step-down correction.
"""
n = len(p_values)
sorted_idx = np.argsort(p_values)
sorted_p = p_values[sorted_idx]
significant = np.zeros(n, dtype=bool)
for i, (idx, p) in enumerate(zip(sorted_idx, sorted_p)):
if p < alpha / (n - i):
significant[idx] = True
else:
break # Stop at first non-significant
return significant
def benjamini_hochberg(p_values, alpha=0.05):
"""
Benjamini-Hochberg FDR correction.
"""
n = len(p_values)
sorted_idx = np.argsort(p_values)
sorted_p = p_values[sorted_idx]
# Find largest k where p(k) <= k*alpha/n
significant = np.zeros(n, dtype=bool)
max_k = 0
for k in range(n):
if sorted_p[k] <= (k + 1) * alpha / n:
max_k = k + 1
# All up to max_k are significant
for k in range(max_k):
significant[sorted_idx[k]] = True
return significant
demonstrate_bonferroni_conservatism()
Better Alternatives
Holm's Step-Down Method
Always at least as powerful as Bonferroni, never less conservative. No reason not to use it.
def holm_detailed(p_values, alpha=0.05):
"""
Holm's method with step-by-step explanation.
"""
n = len(p_values)
sorted_idx = np.argsort(p_values)
sorted_p = p_values[sorted_idx]
print("Holm's Step-Down Procedure:")
print("-" * 60)
print(f"{'Step':<6} {'p-value':<12} {'Threshold':<12} {'Decision':<15}")
print("-" * 60)
significant = np.zeros(n, dtype=bool)
all_rejected = True
for i in range(n):
threshold = alpha / (n - i)
reject = sorted_p[i] < threshold and all_rejected
if reject:
significant[sorted_idx[i]] = True
decision = "Reject"
else:
all_rejected = False
decision = "Fail to reject"
print(f"{i+1:<6} {sorted_p[i]:<12.4f} {threshold:<12.4f} {decision:<15}")
print("-" * 60)
print(f"Total rejected: {significant.sum()}")
return significant
# Example
p_values = np.array([0.001, 0.008, 0.015, 0.025, 0.04, 0.06, 0.10])
holm_detailed(p_values)
False Discovery Rate (FDR)
def fdr_explained():
"""
Explain the difference between FWER and FDR.
"""
print("FWER vs. FDR:")
print("=" * 60)
print()
print("FWER (Family-Wise Error Rate):")
print(" Controls: P(at least 1 false positive)")
print(" Interpretation: 'I'm 95% confident ALL discoveries are real'")
print(" Use when: Each false positive is very costly")
print(" Methods: Bonferroni, Holm, Hochberg")
print()
print("FDR (False Discovery Rate):")
print(" Controls: E[False Positives / All Positives]")
print(" Interpretation: 'On average, 5% of my discoveries are false'")
print(" Use when: Doing exploratory analysis, will follow up")
print(" Methods: Benjamini-Hochberg, Benjamini-Yekutieli")
print()
print("Example with 100 tests, 10 true effects:")
print("-" * 40)
print("FWER=0.05: 'I'm 95% sure I have NO false positives'")
print(" Very strict, might miss real effects")
print()
print("FDR=0.05: 'Of my discoveries, ~5% are false positives'")
print(" If I find 15 significant, expect ~1 false")
fdr_explained()
def benjamini_hochberg_detailed(p_values, alpha=0.05):
"""
BH-FDR procedure with explanation.
"""
n = len(p_values)
sorted_idx = np.argsort(p_values)
sorted_p = p_values[sorted_idx]
print("Benjamini-Hochberg FDR Procedure:")
print("-" * 60)
print(f"{'Rank':<6} {'p-value':<12} {'BH threshold':<15} {'Significant':<12}")
print("-" * 60)
# Find largest k where p(k) <= k*alpha/n
max_k = 0
for k in range(n):
threshold = (k + 1) * alpha / n
if sorted_p[k] <= threshold:
max_k = k + 1
for k in range(n):
threshold = (k + 1) * alpha / n
sig = "Yes" if k < max_k else "No"
print(f"{k+1:<6} {sorted_p[k]:<12.4f} {threshold:<15.4f} {sig:<12}")
print("-" * 60)
print(f"Reject all with rank <= {max_k}")
significant = np.zeros(n, dtype=bool)
for k in range(max_k):
significant[sorted_idx[k]] = True
return significant
# Example
p_values = np.array([0.001, 0.005, 0.01, 0.02, 0.03, 0.04, 0.08, 0.15, 0.30, 0.60])
benjamini_hochberg_detailed(p_values)
Comparison of Methods
def compare_correction_methods(p_values, alpha=0.05):
"""
Compare different correction methods side by side.
"""
n = len(p_values)
# Sort for display
sorted_idx = np.argsort(p_values)
sorted_p = p_values[sorted_idx]
methods = {
'Uncorrected': lambda p: p < alpha,
'Bonferroni': lambda p: p < alpha/n,
'Holm': lambda p: holm_correction(p, alpha),
'BH-FDR': lambda p: benjamini_hochberg(p, alpha)
}
results = {}
for name, method in methods.items():
if name in ['Holm', 'BH-FDR']:
results[name] = method(p_values)
else:
results[name] = np.array([method(p) for p in p_values])
# Display
print("Method Comparison:")
print("=" * 70)
print(f"{'p-value':<12}", end='')
for name in methods:
print(f"{name:<12}", end='')
print()
print("-" * 70)
for i, (idx, p) in enumerate(zip(sorted_idx, sorted_p)):
print(f"{p:<12.4f}", end='')
for name in methods:
sig = "Yes" if results[name][idx] else "No"
print(f"{sig:<12}", end='')
print()
print("-" * 70)
print(f"{'Total sig:':<12}", end='')
for name in methods:
print(f"{results[name].sum():<12}", end='')
print()
# Example
np.random.seed(42)
p_values = np.array([0.001, 0.008, 0.012, 0.024, 0.035, 0.048, 0.06, 0.12, 0.25, 0.55])
compare_correction_methods(p_values)
When NOT to Correct
Independent Decisions
def when_not_to_correct():
"""
Situations where correction may not be appropriate.
"""
no_correction = {
'Independent decisions': {
'example': 'Testing 3 different products for different markets',
'reason': 'Each decision is separate; false positive in one doesn\'t affect others',
},
'Planned comparisons': {
'example': 'Treatment vs control (planned), other comparisons exploratory',
'reason': 'Primary hypothesis is single test; don\'t penalize for exploratory',
},
'Replication settings': {
'example': 'Confirming a previous finding',
'reason': 'Prior evidence reduces false positive concern',
},
'Descriptive analyses': {
'example': 'Exploring patterns to generate hypotheses',
'reason': 'Not making claims; just describing data',
}
}
print("When Multiple Comparison Correction May NOT Be Needed:")
print("=" * 60)
for scenario, info in no_correction.items():
print(f"\n{scenario}:")
print(f" Example: {info['example']}")
print(f" Reason: {info['reason']}")
when_not_to_correct()
The Decision Framework
def correction_decision_framework():
"""
Framework for deciding on correction approach.
"""
print("""
MULTIPLE COMPARISON DECISION FRAMEWORK
=====================================
1. Are you testing multiple hypotheses?
└── No → No correction needed
└── Yes → Continue
2. Are the tests related (answering one question)?
└── No (separate decisions) → Possibly no correction
└── Yes → Continue
3. What's the cost of false positives?
└── Very high (each false positive is costly)
→ Use FWER control (Holm preferred)
└── Moderate (can follow up to verify)
→ Use FDR (Benjamini-Hochberg)
└── Low (exploratory, just generating hypotheses)
→ Consider no correction, report all p-values
4. Which FWER method?
└── Always use Holm over Bonferroni
(same guarantee, more power)
5. Report transparently:
- Number of tests conducted
- Correction method used (and why)
- Both raw and adjusted p-values
""")
correction_decision_framework()
Common Scenarios
Scenario 1: Post-Hoc Pairwise Comparisons
def posthoc_comparison_example():
"""
Multiple comparisons after ANOVA.
"""
print("Post-Hoc Pairwise Comparisons After ANOVA:")
print("-" * 50)
print()
print("4 groups = 6 pairwise comparisons")
print("5 groups = 10 pairwise comparisons")
print("6 groups = 15 pairwise comparisons")
print()
print("Options:")
print(" 1. Tukey's HSD: Controls FWER, designed for this")
print(" 2. Holm: General purpose, good choice")
print(" 3. No correction with Dunnett: If only comparing to control")
print()
print("Note: If ANOVA is not significant, post-hoc may not be needed")
posthoc_comparison_example()
Scenario 2: Multiple Outcomes
def multiple_outcomes_example():
"""
Testing treatment effect on multiple outcomes.
"""
print("Multiple Outcomes in an Experiment:")
print("-" * 50)
print()
print("Testing intervention on: engagement, revenue, retention")
print()
print("Options:")
print()
print("1. Pre-specify PRIMARY outcome")
print(" - Only correct for secondary outcomes")
print(" - Primary gets full alpha")
print()
print("2. Correct across all")
print(" - Use Holm or FDR")
print(" - More conservative for primary")
print()
print("3. Composite outcome")
print(" - Combine outcomes into single metric")
print(" - Single test, no correction needed")
multiple_outcomes_example()
Scenario 3: Exploratory Analysis
def exploratory_analysis_example():
"""
Guidance for exploratory analyses.
"""
print("Exploratory Analysis (many tests):")
print("-" * 50)
print()
print("Common in data mining, genomics, A/B test segmentation")
print()
print("Recommendations:")
print(" 1. Use FDR (Benjamini-Hochberg)")
print(" - Accepts ~5% false discoveries")
print(" - Much more power than FWER methods")
print()
print(" 2. Report all p-values")
print(" - Let readers assess")
print(" - Show distribution of p-values")
print()
print(" 3. Plan for validation")
print(" - Exploratory findings need replication")
print(" - Don't over-interpret significant results")
print()
print(" 4. Be transparent")
print(" - Report total number of tests")
print(" - Explain why FDR vs FWER")
exploratory_analysis_example()
R Implementation
# Multiple comparison corrections in R
p_values <- c(0.001, 0.008, 0.015, 0.025, 0.04, 0.06, 0.10)
# Built-in corrections
p.adjust(p_values, method = "bonferroni")
p.adjust(p_values, method = "holm")
p.adjust(p_values, method = "BH") # Benjamini-Hochberg
p.adjust(p_values, method = "BY") # Benjamini-Yekutieli
# Compare all methods
methods <- c("bonferroni", "holm", "hochberg", "BH", "BY")
adjusted <- sapply(methods, function(m) p.adjust(p_values, method = m))
colnames(adjusted) <- methods
print(adjusted)
# For post-hoc after ANOVA
library(emmeans)
model <- aov(value ~ group, data = df)
emmeans(model, pairwise ~ group, adjust = "tukey")
emmeans(model, pairwise ~ group, adjust = "holm")
Summary Table
| Method | Controls | Power | Best For |
|---|---|---|---|
| Bonferroni | FWER | Lowest | Simple, few tests |
| Holm | FWER | Good | General FWER control |
| Hochberg | FWER | Better | Independent tests |
| BH (FDR) | FDR | High | Exploratory, many tests |
| None | Nothing | Highest | Independent decisions |
Related Methods
- Assumption Checks Master Guide — The pillar article
- Post-Hoc Tests — Pairwise comparisons
- Multiple Experiments — A/B testing context
Key Takeaway
Multiple comparison correction is necessary when many related tests inflate false positive rates. However, Bonferroni is usually too conservative. Holm's method is strictly better and should be your default for FWER control. For exploratory work with many tests, FDR (Benjamini-Hochberg) provides a better power-protection tradeoff. And sometimes—when tests represent independent decisions—no correction is appropriate. Always report how many tests you ran and how you handled multiplicity.
References
- https://www.jstor.org/stable/4615733
- https://www.jstor.org/stable/2346101
- Hochberg, Y., & Benjamini, Y. (1990). More powerful procedures for multiple significance testing. *Statistics in Medicine*, 9(7), 811-818.
- Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. *Journal of the Royal Statistical Society: Series B*, 57(1), 289-300.
- Holm, S. (1979). A simple sequentially rejective multiple test procedure. *Scandinavian Journal of Statistics*, 6(2), 65-70.
Frequently Asked Questions
When should I correct for multiple comparisons?
Why is Bonferroni 'too conservative'?
What's the difference between FWER and FDR?
Key Takeaway
Multiple comparison correction is necessary when multiple related tests inflate false positive rates. However, Bonferroni is often too conservative. Holm's step-down method is always better and should be the default. For exploratory analyses with many tests, FDR provides a better power-protection tradeoff. Sometimes, no correction is appropriate—it depends on the decision context.