Model Evaluation

Paired Evaluation: McNemar's Test for Before/After Classification

When the same examples are evaluated by two models, use McNemar's test for proper inference. Learn why paired analysis is more powerful and how to implement it correctly.

Jan 268 min readstatstest_flow Model Evaluation Supporting

Paired Evaluation: McNemar's Test for Before/After Classification

Quick Hits

•Paired tests are more powerful—they control for example difficulty
•McNemar's test only uses discordant pairs (one right, one wrong)
•Focus on: examples where A got right but B got wrong, and vice versa
•Two models with 80% accuracy can have very different error patterns
•Exact test for small samples (<25 discordant pairs), chi-squared for larger

TL;DR

When evaluating two models on the same examples, use McNemar's test—not independent-sample tests. McNemar's focuses on discordant pairs: examples where one model is right and the other is wrong. This controls for example difficulty and gives more power. The test asks: are (A correct, B wrong) and (A wrong, B correct) equally likely? If not, one model is systematically better.

Why Paired Tests?

The Problem with Unpaired Comparison

Scenario: Model A has 85% accuracy, Model B has 82% accuracy.

Unpaired approach: Compare 0.85 vs 0.82 using a proportion test.

Problem: Ignores that they're evaluated on the same examples.

Why Pairing Matters

import numpy as np
from scipy import stats


def demonstrate_pairing_power():
    """
    Show why paired tests are more powerful.
    """
    np.random.seed(42)
    n = 500

    # Example difficulty (some hard, some easy)
    difficulty = np.random.beta(2, 2, n)

    # Model A: better overall
    prob_a = 0.85 - 0.5 * difficulty  # Harder examples = lower P(correct)
    correct_a = np.random.random(n) < prob_a

    # Model B: slightly worse
    prob_b = 0.82 - 0.5 * difficulty
    correct_b = np.random.random(n) < prob_b

    # Unpaired test (wrong but commonly used)
    acc_a = correct_a.mean()
    acc_b = correct_b.mean()
    pooled_p = (acc_a + acc_b) / 2
    se_unpaired = np.sqrt(pooled_p * (1 - pooled_p) * 2 / n)
    z_unpaired = (acc_a - acc_b) / se_unpaired
    p_unpaired = 2 * (1 - stats.norm.cdf(abs(z_unpaired)))

    # Paired test (correct)
    # Count discordant pairs
    a_only = sum(correct_a & ~correct_b)
    b_only = sum(~correct_a & correct_b)

    # McNemar's chi-squared
    if a_only + b_only > 0:
        chi2 = (abs(a_only - b_only) - 1)**2 / (a_only + b_only)
        p_paired = 1 - stats.chi2.cdf(chi2, 1)
    else:
        p_paired = 1.0

    print("Paired vs Unpaired Test Comparison")
    print("=" * 50)
    print(f"Accuracy A: {acc_a:.1%}")
    print(f"Accuracy B: {acc_b:.1%}")
    print(f"Difference: {acc_a - acc_b:.1%}")
    print(f"\nUnpaired z-test p-value: {p_unpaired:.4f}")
    print(f"Paired McNemar p-value: {p_paired:.4f}")
    print(f"\nDiscordant pairs:")
    print(f"  A correct, B wrong: {a_only}")
    print(f"  A wrong, B correct: {b_only}")
    print(f"\nPaired test is {'more' if p_paired < p_unpaired else 'less'} significant")


demonstrate_pairing_power()

Visual: The $2 \\times 2$ Table

                  Model B
                  Correct    Wrong
Model A  Correct    a (both)   b (A only)
         Wrong      c (B only) d (neither)

a: Both correct (concordant)
d: Both wrong (concordant)
b: Only A correct (discordant)
c: Only B correct (discordant)

McNemar's test uses only b and c.

McNemar's Test

The Statistic

$\chi^2 = \frac{(b - c)^2}{b + c}$

Or with continuity correction:

$\chi^2 = \frac{(|b - c| - 1)^2}{b + c}$

Where:

b = (A correct, B wrong)
c = (A wrong, B correct)

Under $H_0$ (models equivalent): $\chi^2 \sim \chi^2(1)$

Implementation

def mcnemar_test(both_correct, a_only, b_only, both_wrong,
                 exact=None, continuity=True):
    """
    McNemar's test for paired binary outcomes.

    Parameters:
    -----------
    both_correct : int (a)
        Both models correct
    a_only : int (b)
        Model A correct, B wrong
    b_only : int (c)
        Model A wrong, B correct
    both_wrong : int (d)
        Both models wrong
    exact : bool or None
        Force exact test (True) or chi-squared (False). If None, auto-select.
    continuity : bool
        Apply continuity correction to chi-squared test

    Returns:
    --------
    Test results with p-value and effect size
    """
    n = both_correct + a_only + b_only + both_wrong
    n_discordant = a_only + b_only

    # Accuracies
    acc_a = (both_correct + a_only) / n
    acc_b = (both_correct + b_only) / n

    # Decide exact vs chi-squared
    if exact is None:
        exact = n_discordant < 25

    if n_discordant == 0:
        return {
            'p_value': 1.0,
            'test': 'No discordant pairs',
            'a_only': a_only,
            'b_only': b_only,
            'accuracy_a': acc_a,
            'accuracy_b': acc_b
        }

    if exact:
        # Exact binomial test
        p_value = stats.binom_test(a_only, n_discordant, p=0.5, alternative='two-sided')
        test_name = 'Exact binomial'
    else:
        # Chi-squared with continuity correction
        if continuity:
            chi2 = (abs(a_only - b_only) - 1)**2 / n_discordant
        else:
            chi2 = (a_only - b_only)**2 / n_discordant

        p_value = 1 - stats.chi2.cdf(chi2, 1)
        test_name = f"Chi-squared {'(with cc)' if continuity else ''}"

    # Effect size: odds ratio of discordant pairs
    odds_ratio = a_only / b_only if b_only > 0 else np.inf

    return {
        'test': test_name,
        'p_value': p_value,
        'significant': p_value < 0.05,
        'a_only': a_only,
        'b_only': b_only,
        'n_discordant': n_discordant,
        'accuracy_a': acc_a,
        'accuracy_b': acc_b,
        'accuracy_diff': acc_a - acc_b,
        'odds_ratio': odds_ratio,
        'both_correct': both_correct,
        'both_wrong': both_wrong
    }


# Example: Classification models
result = mcnemar_test(
    both_correct=680,
    a_only=95,
    b_only=60,
    both_wrong=165
)

print("McNemar's Test Results")
print("=" * 50)
print(f"Test used: {result['test']}")
print(f"\nContingency table:")
print(f"  Both correct: {result['both_correct']}")
print(f"  Only A correct: {result['a_only']}")
print(f"  Only B correct: {result['b_only']}")
print(f"  Both wrong: {result['both_wrong']}")
print(f"\nAccuracy A: {result['accuracy_a']:.1%}")
print(f"Accuracy B: {result['accuracy_b']:.1%}")
print(f"Difference: {result['accuracy_diff']:+.1%}")
print(f"\np-value: {result['p_value']:.4f}")
print(f"Significant at α=0.05: {result['significant']}")
print(f"\nOdds ratio (A only / B only): {result['odds_ratio']:.2f}")

Confidence Interval for Difference

def mcnemar_ci(a_only, b_only, n_total, alpha=0.05):
    """
    Confidence interval for accuracy difference.

    Uses the method from Newcombe (1998).
    """
    # Point estimate
    diff = (a_only - b_only) / n_total

    # Approximate SE for paired difference
    # Based on the fact that the marginal is fixed
    p_diff = (a_only - b_only) / n_total
    p_disc = (a_only + b_only) / n_total

    # Wald interval (simple)
    se = np.sqrt((a_only + b_only - (a_only - b_only)**2 / n_total) / n_total**2)
    z = stats.norm.ppf(1 - alpha / 2)

    ci_lower = diff - z * se
    ci_upper = diff + z * se

    return {
        'difference': diff,
        'se': se,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper
    }


# Example
ci_result = mcnemar_ci(a_only=95, b_only=60, n_total=1000)
print(f"\nAccuracy difference: {ci_result['difference']:+.1%}")
print(f"95% CI: ({ci_result['ci_lower']:+.1%}, {ci_result['ci_upper']:+.1%})")

Multiple Models: Cochran's Q

When comparing more than 2 models on same examples:

def cochrans_q(correct_matrix):
    """
    Cochran's Q test for multiple models on same examples.

    Parameters:
    -----------
    correct_matrix : array
        Shape (n_examples, n_models), binary (1 = correct)

    Returns:
    --------
    Q statistic and p-value
    """
    correct_matrix = np.array(correct_matrix)
    n_examples, n_models = correct_matrix.shape

    # Row sums (total correct per example)
    L = correct_matrix.sum(axis=1)

    # Column sums (total correct per model)
    T = correct_matrix.sum(axis=0)

    # Grand total
    N = correct_matrix.sum()

    # Cochran's Q
    numerator = (n_models - 1) * (n_models * np.sum(T**2) - N**2)
    denominator = n_models * N - np.sum(L**2)

    Q = numerator / denominator if denominator > 0 else 0

    # Under H0, Q ~ chi-squared(k-1)
    p_value = 1 - stats.chi2.cdf(Q, n_models - 1)

    return {
        'Q': Q,
        'df': n_models - 1,
        'p_value': p_value,
        'model_accuracies': T / n_examples
    }


# Example: 3 models, 500 examples
np.random.seed(42)
n_examples = 500
n_models = 3

# Simulate with model A best, B medium, C worst
difficulty = np.random.beta(2, 2, n_examples)
correct = np.zeros((n_examples, n_models))

for m, base_acc in enumerate([0.85, 0.82, 0.78]):
    prob = base_acc - 0.4 * difficulty
    correct[:, m] = np.random.random(n_examples) < prob

result = cochrans_q(correct)

print("Cochran's Q Test (Multiple Models)")
print("=" * 50)
print(f"Q statistic: {result['Q']:.2f}")
print(f"Degrees of freedom: {result['df']}")
print(f"p-value: {result['p_value']:.4f}")
print(f"\nModel accuracies: {[f'{a:.1%}' for a in result['model_accuracies']]}")

Sample Size and Power

How Many Discordant Pairs?

def sample_size_mcnemar(prop_discordant, effect_size, power=0.8, alpha=0.05):
    """
    Sample size for McNemar's test.

    Parameters:
    -----------
    prop_discordant : float
        Expected proportion of discordant pairs
    effect_size : float
        Expected difference in discordant proportions (e.g., 0.1 = 55% vs 45%)
    power : float
        Desired power
    alpha : float
        Significance level

    Returns:
    --------
    Required total sample size
    """
    from scipy.stats import norm

    # Under alternative: P(A only) = (1 + effect_size) / 2
    p1 = (1 + effect_size) / 2
    p0 = 0.5  # Under null

    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)

    # Sample size for discordant pairs
    n_disc = ((z_alpha * np.sqrt(2 * p0 * (1 - p0)) +
               z_beta * np.sqrt(p1 * (1 - p1) + (1 - p1) * p1)) / effect_size) ** 2

    # Total sample size
    n_total = n_disc / prop_discordant

    return int(np.ceil(n_total))


print("Sample Size for McNemar's Test")
print("=" * 50)
print("(Assuming 20% discordant pairs)")
print(f"{'Effect':>15} {'Required n':>15}")
print("-" * 30)

for effect in [0.05, 0.10, 0.15, 0.20]:
    n = sample_size_mcnemar(0.20, effect)
    print(f"{f'{50+effect*50:.0f}% vs {50-effect*50:.0f}%':>15} {n:>15,}")

R Implementation

# McNemar's test
mcnemar.test(matrix(c(both_correct, a_only, b_only, both_wrong),
                     nrow = 2))

# Exact McNemar (small samples)
mcnemar.test(table, exact = TRUE)

# Cochran's Q (multiple models)
library(RVAideMemoire)
cochran.qtest(correct_matrix)

Common Mistakes

Mistake 1: Using Unpaired Tests

Wrong: z-test comparing proportions Right: McNemar's test on contingency table

Mistake 2: Ignoring Concordant Pairs Info

Concordant pairs tell you about overall task difficulty:

Many "both correct": Easy task
Many "both wrong": Hard task

Mistake 3: Small Sample Exact Test

Use exact binomial when n_discordant < 25:

p_value = stats.binom_test(a_only, a_only + b_only, p=0.5)

Model Evaluation (Pillar) - Complete framework
Comparing Models: Win Rate - Unpaired comparison
Bootstrap for Metric Deltas - Continuous metrics
Paired vs. Independent Data - When to use paired tests

Key Takeaway

When comparing models on the same examples, use McNemar's test—it's the paired equivalent of comparing proportions. The test focuses on discordant pairs: examples where one model succeeds and the other fails. This is more powerful than unpaired tests because it controls for example difficulty. Count (A correct, B wrong) vs (A wrong, B correct) and test whether they differ. The same models with identical accuracy can have very different discordant patterns—McNemar's reveals whether one truly outperforms the other.

References

https://doi.org/10.1007/BF02295996
https://www.jmlr.org/papers/v7/demsar06a.html
https://doi.org/10.1093/biomet/34.1-2.123
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. *Psychometrika*, 12(2), 153-157.
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. *Neural Computation*, 10(7), 1895-1923.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. *JMLR*, 7, 1-30.

Frequently Asked Questions

Why not just compare accuracy rates with a z-test?

Because the samples aren't independent—they're the same examples. Paired analysis is more powerful because it controls for example difficulty. Two models might both struggle on hard examples but differ on medium ones; paired analysis isolates those differences.

What are discordant pairs?

Examples where the two models disagree: one is correct and the other is wrong. McNemar's test compares (A correct, B wrong) vs (A wrong, B correct). Concordant pairs—both right or both wrong—don't help distinguish the models.

Can I use McNemar's test for continuous metrics?

McNemar's is for binary outcomes (correct/wrong). For continuous metrics, use a paired t-test or Wilcoxon signed-rank test. For ordinal ratings, use the sign test or signed-rank test.

Key Takeaway

When comparing models on the same examples, use paired tests. McNemar's test examines only the discordant pairs—where one model succeeded and the other failed. This is more powerful than unpaired tests because it isolates the models' differences while controlling for shared difficulty. The test is simple: count (A right, B wrong) vs (A wrong, B right) and compare to a coin flip.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email