Contents
Paired Evaluation: McNemar's Test for Before/After Classification
When the same examples are evaluated by two models, use McNemar's test for proper inference. Learn why paired analysis is more powerful and how to implement it correctly.
Quick Hits
- •Paired tests are more powerful—they control for example difficulty
- •McNemar's test only uses discordant pairs (one right, one wrong)
- •Focus on: examples where A got right but B got wrong, and vice versa
- •Two models with 80% accuracy can have very different error patterns
- •Exact test for small samples (<25 discordant pairs), chi-squared for larger
TL;DR
When evaluating two models on the same examples, use McNemar's test—not independent-sample tests. McNemar's focuses on discordant pairs: examples where one model is right and the other is wrong. This controls for example difficulty and gives more power. The test asks: are (A correct, B wrong) and (A wrong, B correct) equally likely? If not, one model is systematically better.
Why Paired Tests?
The Problem with Unpaired Comparison
Scenario: Model A has 85% accuracy, Model B has 82% accuracy.
Unpaired approach: Compare 0.85 vs 0.82 using a proportion test.
Problem: Ignores that they're evaluated on the same examples.
Why Pairing Matters
import numpy as np
from scipy import stats
def demonstrate_pairing_power():
"""
Show why paired tests are more powerful.
"""
np.random.seed(42)
n = 500
# Example difficulty (some hard, some easy)
difficulty = np.random.beta(2, 2, n)
# Model A: better overall
prob_a = 0.85 - 0.5 * difficulty # Harder examples = lower P(correct)
correct_a = np.random.random(n) < prob_a
# Model B: slightly worse
prob_b = 0.82 - 0.5 * difficulty
correct_b = np.random.random(n) < prob_b
# Unpaired test (wrong but commonly used)
acc_a = correct_a.mean()
acc_b = correct_b.mean()
pooled_p = (acc_a + acc_b) / 2
se_unpaired = np.sqrt(pooled_p * (1 - pooled_p) * 2 / n)
z_unpaired = (acc_a - acc_b) / se_unpaired
p_unpaired = 2 * (1 - stats.norm.cdf(abs(z_unpaired)))
# Paired test (correct)
# Count discordant pairs
a_only = sum(correct_a & ~correct_b)
b_only = sum(~correct_a & correct_b)
# McNemar's chi-squared
if a_only + b_only > 0:
chi2 = (abs(a_only - b_only) - 1)**2 / (a_only + b_only)
p_paired = 1 - stats.chi2.cdf(chi2, 1)
else:
p_paired = 1.0
print("Paired vs Unpaired Test Comparison")
print("=" * 50)
print(f"Accuracy A: {acc_a:.1%}")
print(f"Accuracy B: {acc_b:.1%}")
print(f"Difference: {acc_a - acc_b:.1%}")
print(f"\nUnpaired z-test p-value: {p_unpaired:.4f}")
print(f"Paired McNemar p-value: {p_paired:.4f}")
print(f"\nDiscordant pairs:")
print(f" A correct, B wrong: {a_only}")
print(f" A wrong, B correct: {b_only}")
print(f"\nPaired test is {'more' if p_paired < p_unpaired else 'less'} significant")
demonstrate_pairing_power()
Visual: The 2×2 Table
Model B
Correct Wrong
Model A Correct a (both) b (A only)
Wrong c (B only) d (neither)
- a: Both correct (concordant)
- d: Both wrong (concordant)
- b: Only A correct (discordant)
- c: Only B correct (discordant)
McNemar's test uses only b and c.
McNemar's Test
The Statistic
$$\chi^2 = \frac{(b - c)^2}{b + c}$$
Or with continuity correction:
$$\chi^2 = \frac{(|b - c| - 1)^2}{b + c}$$
Where:
- b = (A correct, B wrong)
- c = (A wrong, B correct)
Under H₀ (models equivalent): χ² ~ χ²(1)
Implementation
def mcnemar_test(both_correct, a_only, b_only, both_wrong,
exact=None, continuity=True):
"""
McNemar's test for paired binary outcomes.
Parameters:
-----------
both_correct : int (a)
Both models correct
a_only : int (b)
Model A correct, B wrong
b_only : int (c)
Model A wrong, B correct
both_wrong : int (d)
Both models wrong
exact : bool or None
Force exact test (True) or chi-squared (False). If None, auto-select.
continuity : bool
Apply continuity correction to chi-squared test
Returns:
--------
Test results with p-value and effect size
"""
n = both_correct + a_only + b_only + both_wrong
n_discordant = a_only + b_only
# Accuracies
acc_a = (both_correct + a_only) / n
acc_b = (both_correct + b_only) / n
# Decide exact vs chi-squared
if exact is None:
exact = n_discordant < 25
if n_discordant == 0:
return {
'p_value': 1.0,
'test': 'No discordant pairs',
'a_only': a_only,
'b_only': b_only,
'accuracy_a': acc_a,
'accuracy_b': acc_b
}
if exact:
# Exact binomial test
p_value = stats.binom_test(a_only, n_discordant, p=0.5, alternative='two-sided')
test_name = 'Exact binomial'
else:
# Chi-squared with continuity correction
if continuity:
chi2 = (abs(a_only - b_only) - 1)**2 / n_discordant
else:
chi2 = (a_only - b_only)**2 / n_discordant
p_value = 1 - stats.chi2.cdf(chi2, 1)
test_name = f"Chi-squared {'(with cc)' if continuity else ''}"
# Effect size: odds ratio of discordant pairs
odds_ratio = a_only / b_only if b_only > 0 else np.inf
return {
'test': test_name,
'p_value': p_value,
'significant': p_value < 0.05,
'a_only': a_only,
'b_only': b_only,
'n_discordant': n_discordant,
'accuracy_a': acc_a,
'accuracy_b': acc_b,
'accuracy_diff': acc_a - acc_b,
'odds_ratio': odds_ratio,
'both_correct': both_correct,
'both_wrong': both_wrong
}
# Example: Classification models
result = mcnemar_test(
both_correct=680,
a_only=95,
b_only=60,
both_wrong=165
)
print("McNemar's Test Results")
print("=" * 50)
print(f"Test used: {result['test']}")
print(f"\nContingency table:")
print(f" Both correct: {result['both_correct']}")
print(f" Only A correct: {result['a_only']}")
print(f" Only B correct: {result['b_only']}")
print(f" Both wrong: {result['both_wrong']}")
print(f"\nAccuracy A: {result['accuracy_a']:.1%}")
print(f"Accuracy B: {result['accuracy_b']:.1%}")
print(f"Difference: {result['accuracy_diff']:+.1%}")
print(f"\np-value: {result['p_value']:.4f}")
print(f"Significant at α=0.05: {result['significant']}")
print(f"\nOdds ratio (A only / B only): {result['odds_ratio']:.2f}")
Confidence Interval for Difference
def mcnemar_ci(a_only, b_only, n_total, alpha=0.05):
"""
Confidence interval for accuracy difference.
Uses the method from Newcombe (1998).
"""
# Point estimate
diff = (a_only - b_only) / n_total
# Approximate SE for paired difference
# Based on the fact that the marginal is fixed
p_diff = (a_only - b_only) / n_total
p_disc = (a_only + b_only) / n_total
# Wald interval (simple)
se = np.sqrt((a_only + b_only - (a_only - b_only)**2 / n_total) / n_total**2)
z = stats.norm.ppf(1 - alpha / 2)
ci_lower = diff - z * se
ci_upper = diff + z * se
return {
'difference': diff,
'se': se,
'ci_lower': ci_lower,
'ci_upper': ci_upper
}
# Example
ci_result = mcnemar_ci(a_only=95, b_only=60, n_total=1000)
print(f"\nAccuracy difference: {ci_result['difference']:+.1%}")
print(f"95% CI: ({ci_result['ci_lower']:+.1%}, {ci_result['ci_upper']:+.1%})")
Multiple Models: Cochran's Q
When comparing more than 2 models on same examples:
def cochrans_q(correct_matrix):
"""
Cochran's Q test for multiple models on same examples.
Parameters:
-----------
correct_matrix : array
Shape (n_examples, n_models), binary (1 = correct)
Returns:
--------
Q statistic and p-value
"""
correct_matrix = np.array(correct_matrix)
n_examples, n_models = correct_matrix.shape
# Row sums (total correct per example)
L = correct_matrix.sum(axis=1)
# Column sums (total correct per model)
T = correct_matrix.sum(axis=0)
# Grand total
N = correct_matrix.sum()
# Cochran's Q
numerator = (n_models - 1) * (n_models * np.sum(T**2) - N**2)
denominator = n_models * N - np.sum(L**2)
Q = numerator / denominator if denominator > 0 else 0
# Under H0, Q ~ chi-squared(k-1)
p_value = 1 - stats.chi2.cdf(Q, n_models - 1)
return {
'Q': Q,
'df': n_models - 1,
'p_value': p_value,
'model_accuracies': T / n_examples
}
# Example: 3 models, 500 examples
np.random.seed(42)
n_examples = 500
n_models = 3
# Simulate with model A best, B medium, C worst
difficulty = np.random.beta(2, 2, n_examples)
correct = np.zeros((n_examples, n_models))
for m, base_acc in enumerate([0.85, 0.82, 0.78]):
prob = base_acc - 0.4 * difficulty
correct[:, m] = np.random.random(n_examples) < prob
result = cochrans_q(correct)
print("Cochran's Q Test (Multiple Models)")
print("=" * 50)
print(f"Q statistic: {result['Q']:.2f}")
print(f"Degrees of freedom: {result['df']}")
print(f"p-value: {result['p_value']:.4f}")
print(f"\nModel accuracies: {[f'{a:.1%}' for a in result['model_accuracies']]}")
Sample Size and Power
How Many Discordant Pairs?
def sample_size_mcnemar(prop_discordant, effect_size, power=0.8, alpha=0.05):
"""
Sample size for McNemar's test.
Parameters:
-----------
prop_discordant : float
Expected proportion of discordant pairs
effect_size : float
Expected difference in discordant proportions (e.g., 0.1 = 55% vs 45%)
power : float
Desired power
alpha : float
Significance level
Returns:
--------
Required total sample size
"""
from scipy.stats import norm
# Under alternative: P(A only) = (1 + effect_size) / 2
p1 = (1 + effect_size) / 2
p0 = 0.5 # Under null
z_alpha = norm.ppf(1 - alpha / 2)
z_beta = norm.ppf(power)
# Sample size for discordant pairs
n_disc = ((z_alpha * np.sqrt(2 * p0 * (1 - p0)) +
z_beta * np.sqrt(p1 * (1 - p1) + (1 - p1) * p1)) / effect_size) ** 2
# Total sample size
n_total = n_disc / prop_discordant
return int(np.ceil(n_total))
print("Sample Size for McNemar's Test")
print("=" * 50)
print("(Assuming 20% discordant pairs)")
print(f"{'Effect':>15} {'Required n':>15}")
print("-" * 30)
for effect in [0.05, 0.10, 0.15, 0.20]:
n = sample_size_mcnemar(0.20, effect)
print(f"{f'{50+effect*50:.0f}% vs {50-effect*50:.0f}%':>15} {n:>15,}")
R Implementation
# McNemar's test
mcnemar.test(matrix(c(both_correct, a_only, b_only, both_wrong),
nrow = 2))
# Exact McNemar (small samples)
mcnemar.test(table, exact = TRUE)
# Cochran's Q (multiple models)
library(RVAideMemoire)
cochran.qtest(correct_matrix)
Common Mistakes
Mistake 1: Using Unpaired Tests
Wrong: z-test comparing proportions Right: McNemar's test on contingency table
Mistake 2: Ignoring Concordant Pairs Info
Concordant pairs tell you about overall task difficulty:
- Many "both correct": Easy task
- Many "both wrong": Hard task
Mistake 3: Small Sample Exact Test
Use exact binomial when n_discordant < 25:
p_value = stats.binom_test(a_only, a_only + b_only, p=0.5)
Related Methods
- Model Evaluation (Pillar) - Complete framework
- Comparing Models: Win Rate - Unpaired comparison
- Bootstrap for Metric Deltas - Continuous metrics
- Paired vs. Independent Data - When to use paired tests
Key Takeaway
When comparing models on the same examples, use McNemar's test—it's the paired equivalent of comparing proportions. The test focuses on discordant pairs: examples where one model succeeds and the other fails. This is more powerful than unpaired tests because it controls for example difficulty. Count (A correct, B wrong) vs (A wrong, B correct) and test whether they differ. The same models with identical accuracy can have very different discordant patterns—McNemar's reveals whether one truly outperforms the other.
References
- https://doi.org/10.1007/BF02295996
- https://www.jmlr.org/papers/v7/demsar06a.html
- https://doi.org/10.1093/biomet/34.1-2.123
- McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. *Psychometrika*, 12(2), 153-157.
- Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. *Neural Computation*, 10(7), 1895-1923.
- Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. *JMLR*, 7, 1-30.
Frequently Asked Questions
Why not just compare accuracy rates with a z-test?
What are discordant pairs?
Can I use McNemar's test for continuous metrics?
Key Takeaway
When comparing models on the same examples, use paired tests. McNemar's test examines only the discordant pairs—where one model succeeded and the other failed. This is more powerful than unpaired tests because it isolates the models' differences while controlling for shared difficulty. The test is simple: count (A right, B wrong) vs (A wrong, B right) and compare to a coin flip.