Model Evaluation

Statistically Significant but Meaningless: Practical Thresholds for Evals

A 0.5% accuracy improvement with p<0.001 is real but worthless. Learn how to distinguish statistically significant from practically meaningful in model evaluation.

Share

Quick Hits

  • Statistical significance = unlikely due to chance. Practical significance = worth caring about.
  • With large eval sets, tiny differences become significant but remain meaningless
  • Define minimum important difference (MID) before evaluation, not after
  • Report effect sizes and confidence intervals, not just p-values
  • Ask: 'Would users notice?' and 'Does the cost-benefit make sense?'

TL;DR

Statistical significance means an effect is unlikely due to chance. Practical significance means it matters. With large evaluation sets, even tiny model improvements become statistically significant. The solution: pre-define your minimum important difference (MID), report effect sizes and confidence intervals (not just p-values), and ask whether the improvement justifies the costs. A 0.5% accuracy gain with p < 0.001 is real but probably not worth deploying.


The Problem

Scenario

You evaluate your new model on 50,000 examples:

  • Accuracy improves from 87.2% to 87.5%
  • This 0.3% improvement has p < 0.001

Questions:

  1. Is this improvement real? Yes (highly significant)
  2. Is this improvement meaningful? Probably not

Why This Happens

Sample size drives p-values:

$$\text{SE} = \frac{\sigma}{\sqrt{n}}$$

As n increases:

  • SE decreases
  • Smaller effects become detectable
  • Eventually, any non-zero effect becomes significant
import numpy as np
from scipy import stats


def demonstrate_significance_vs_effect():
    """
    Show how significance depends on sample size, not effect size.
    """
    np.random.seed(42)
    effect_size = 0.003  # 0.3% accuracy difference

    results = []
    for n in [1000, 5000, 10000, 50000, 100000]:
        # Simulate many trials
        significant_count = 0
        for _ in range(500):
            # Model A accuracy ~ 87.2%
            acc_a = np.random.binomial(n, 0.872) / n

            # Model B accuracy ~ 87.5%
            acc_b = np.random.binomial(n, 0.872 + effect_size) / n

            # Z-test for difference
            pooled_p = (acc_a + acc_b) / 2
            se = np.sqrt(pooled_p * (1 - pooled_p) * 2 / n)
            z = (acc_b - acc_a) / se
            p_value = 2 * (1 - stats.norm.cdf(abs(z)))

            if p_value < 0.05:
                significant_count += 1

        results.append({
            'n': n,
            'power': significant_count / 500,
            'effect': effect_size
        })

    print("Statistical Significance vs Sample Size")
    print("(Effect size = 0.3% accuracy difference)")
    print("=" * 50)
    print(f"{'Sample Size':>15} {'Power (% significant)':>25}")
    print("-" * 50)
    for r in results:
        print(f"{r['n']:>15,} {r['power']:>25.1%}")

    print("\nSame tiny effect becomes 'significant' with enough data")


demonstrate_significance_vs_effect()

Statistical vs. Practical Significance

Definitions

Concept Meaning Measured By
Statistical significance Effect unlikely due to chance p-value < α
Practical significance Effect is meaningful/useful Effect size, business impact

The 2×2 Matrix

Practically Significant Practically Insignificant
Statistically Significant Ship it ✓ Real but useless
Not Significant Might be useful, underpowered Nothing to see

What Each Quadrant Means

  1. Sig + Practical: Clear win—effect is real and matters
  2. Sig + Not Practical: Large sample detected tiny effect
  3. Not Sig + Practical: Underpowered—need more data
  4. Not Sig + Not Practical: No evidence of meaningful effect

Minimum Important Difference (MID)

Defining Your Threshold

Before evaluation, answer: "What's the smallest improvement worth caring about?"

def define_mid(domain, costs, current_performance):
    """
    Framework for defining minimum important difference.
    """
    considerations = {
        'user_perceptible': "Would users notice this improvement?",
        'business_impact': "How much revenue/cost does 1% improvement represent?",
        'deployment_cost': "What's the cost of deploying the new model?",
        'risk': "What are the risks if the new model has hidden problems?",
        'opportunity_cost': "What else could we work on instead?"
    }

    # Example thresholds by domain
    example_mids = {
        'content_moderation': 0.005,  # 0.5% - safety critical
        'recommendation': 0.02,        # 2% - competitive advantage
        'search_ranking': 0.01,        # 1% - user experience
        'spam_detection': 0.01,        # 1% - user experience
        'llm_quality': 0.03,           # 3% win rate improvement
    }

    return {
        'considerations': considerations,
        'example_thresholds': example_mids
    }


mid_framework = define_mid('search', costs={'deployment': 'high'}, current_performance=0.85)
print("MID Definition Framework")
print("=" * 50)
print("\nConsiderations:")
for key, question in mid_framework['considerations'].items():
    print(f"  • {question}")

print("\nExample MIDs by domain:")
for domain, mid in mid_framework['example_thresholds'].items():
    print(f"  {domain}: {mid:.1%}")

MID in Practice

def evaluate_with_mid(acc_new, acc_baseline, n_examples, mid=0.02, alpha=0.05):
    """
    Evaluate improvement considering both statistical and practical significance.
    """
    diff = acc_new - acc_baseline

    # Statistical test
    pooled_p = (acc_new + acc_baseline) / 2
    se = np.sqrt(pooled_p * (1 - pooled_p) * 2 / n_examples)
    z = diff / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    # Confidence interval
    ci_lower = diff - 1.96 * se
    ci_upper = diff + 1.96 * se

    # Classification
    statistically_significant = p_value < alpha and diff > 0
    practically_significant = diff >= mid
    ci_exceeds_mid = ci_lower >= mid

    if ci_exceeds_mid:
        conclusion = "Clear improvement (CI entirely above MID)"
        action = "Strong evidence to deploy"
    elif practically_significant and statistically_significant:
        conclusion = "Likely improvement (point estimate above MID, significant)"
        action = "Consider deploying, but CI includes below-MID values"
    elif statistically_significant and not practically_significant:
        conclusion = "Detectable but small improvement"
        action = "Real effect but probably not worth deployment cost"
    elif practically_significant and not statistically_significant:
        conclusion = "Potentially meaningful but uncertain"
        action = "Collect more data"
    else:
        conclusion = "No meaningful improvement detected"
        action = "Do not deploy"

    return {
        'difference': diff,
        'ci': (ci_lower, ci_upper),
        'p_value': p_value,
        'mid': mid,
        'statistically_significant': statistically_significant,
        'practically_significant': practically_significant,
        'conclusion': conclusion,
        'action': action
    }


# Example evaluations
print("Evaluation with MID = 2%")
print("=" * 60)

scenarios = [
    ("Tiny effect, huge n", 0.875, 0.872, 100000),
    ("Moderate effect, moderate n", 0.90, 0.87, 5000),
    ("Large effect, small n", 0.92, 0.87, 500),
    ("At MID, moderate n", 0.89, 0.87, 5000),
]

for name, new, base, n in scenarios:
    result = evaluate_with_mid(new, base, n)
    print(f"\n{name}:")
    print(f"  Accuracy: {base:.1%} → {new:.1%} (Δ = {result['difference']:+.1%})")
    print(f"  95% CI: ({result['ci'][0]:+.1%}, {result['ci'][1]:+.1%})")
    print(f"  p-value: {result['p_value']:.4f}")
    print(f"  Statistical sig: {result['statistically_significant']}")
    print(f"  Practical sig (≥MID): {result['practically_significant']}")
    print(f"  → {result['conclusion']}")
    print(f"  → Action: {result['action']}")

Reporting Framework

What to Report

  1. Effect size (the actual difference)
  2. Confidence interval (range of plausible effects)
  3. p-value (evidence against null)
  4. Pre-specified MID (threshold for caring)
  5. Interpretation (combining all of the above)

Template

## Results

### Primary Metric: Accuracy
- Baseline: 87.2%
- New Model: 87.8%
- **Difference: +0.6%** (95% CI: +0.3% to +0.9%)
- p-value: 0.001
- Pre-specified MID: 1.0%

### Interpretation
The improvement is statistically significant (p = 0.001) but the
confidence interval (0.3% to 0.9%) falls entirely below our
pre-specified minimum important difference of 1.0%.

**Conclusion**: While we can be confident the new model is slightly
better, the improvement is smaller than what we defined as
practically meaningful.

**Recommendation**: Do not deploy based on accuracy alone.
Consider other factors (latency, cost, secondary metrics) or
wait for evidence of larger improvements.

Calibrating Expectations

Cohen's Benchmarks for Effect Sizes

Effect Size Cohen's d Interpretation
Small 0.2 Barely noticeable
Medium 0.5 Noticeable
Large 0.8 Obvious

But these are general—calibrate to your domain.

Domain-Specific Calibration

def calibrate_expectations(historical_improvements, new_improvement):
    """
    Compare new improvement to historical distribution.
    """
    historical = np.array(historical_improvements)

    percentile = np.mean(historical <= new_improvement) * 100

    return {
        'new_improvement': new_improvement,
        'historical_mean': np.mean(historical),
        'historical_median': np.median(historical),
        'historical_std': np.std(historical),
        'percentile': percentile,
        'interpretation': f"This improvement is larger than {percentile:.0f}% of historical improvements"
    }


# Example: Your team's historical model improvements
historical = [0.005, 0.012, 0.008, 0.025, 0.003, 0.015, 0.007, 0.02, 0.001, 0.018]
new = 0.015

calibration = calibrate_expectations(historical, new)
print("Calibration Against Historical Improvements")
print("=" * 50)
print(f"Historical: mean={calibration['historical_mean']:.1%}, "
      f"median={calibration['historical_median']:.1%}")
print(f"New improvement: {calibration['new_improvement']:.1%}")
print(f"Percentile: {calibration['percentile']:.0f}th")
print(f"→ {calibration['interpretation']}")

Decision Framework

EVALUATION COMPLETE
       ↓
QUESTION: Is the effect statistically significant?
├── No → Insufficient evidence (need more data or no effect)
└── Yes → Continue
       ↓
QUESTION: Does CI exceed your MID?
├── Yes → Clear practical significance, deploy
└── No → Continue
       ↓
QUESTION: Does point estimate exceed MID?
├── Yes → Probable practical significance, consider deployment
└── No → Continue
       ↓
QUESTION: Is the effect size close to MID?
├── Yes → Borderline, consider costs/benefits carefully
└── No → Statistically real but practically small
       ↓
DECISION:
- Factor in deployment costs
- Consider secondary metrics
- Evaluate risks of new model
- Make explicit cost-benefit judgment

Common Mistakes

Mistake 1: Confusing the Two Significances

Wrong: "p < 0.05 means we should ship" Right: "p < 0.05 means the effect is real; we still need to judge if it's useful"

Mistake 2: Post-Hoc MID Definition

Wrong: "The improvement was 1.5%, which is definitely meaningful" Right: Pre-specify MID before evaluation

Mistake 3: Ignoring Confidence Intervals

Wrong: "Model improved by 2%" Right: "Model improved by 2% (95% CI: 0.5% to 3.5%)"

Mistake 4: Binary Thinking

Wrong: "p = 0.049 means significant, p = 0.051 means not" Right: Report exact p-values and let readers interpret


R Implementation

# Evaluate with MID
evaluate_with_mid <- function(acc_new, acc_base, n, mid = 0.02, alpha = 0.05) {
    diff <- acc_new - acc_base
    pooled_p <- (acc_new + acc_base) / 2
    se <- sqrt(pooled_p * (1 - pooled_p) * 2 / n)

    z <- diff / se
    p_value <- 2 * pnorm(-abs(z))

    ci <- c(diff - 1.96 * se, diff + 1.96 * se)

    list(
        difference = diff,
        ci = ci,
        p_value = p_value,
        stat_sig = p_value < alpha & diff > 0,
        practical_sig = diff >= mid,
        ci_exceeds_mid = ci[1] >= mid
    )
}


Key Takeaway

Statistical significance tells you an effect is real; practical significance tells you it matters. With large evaluation sets, even tiny improvements become statistically significant—a 0.3% accuracy gain with p < 0.001 is undeniably real but probably not worth deploying. Pre-specify your minimum important difference before evaluation, report confidence intervals so readers can judge both types of significance, and make explicit cost-benefit judgments. The question isn't just "is this improvement real?" but "is this improvement worth acting on?"


References

  1. https://doi.org/10.1177/2515245918770963
  2. https://www.jstor.org/stable/3001666
  3. https://arxiv.org/abs/1903.06372
  4. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. *The American Statistician*, 70(2), 129-133.
  5. Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Lawrence Erlbaum.
  6. Kelley, K., & Preacher, K. J. (2012). On effect size. *Psychological Methods*, 17(2), 137-152.

Frequently Asked Questions

What's a good threshold for 'meaningful' improvement?
It depends on your domain, costs, and users. For accuracy: 1-2% absolute might matter for high-stakes applications, 5%+ for lower stakes. For win rate: 55%+ (model wins significantly more) is often the bar. Define your threshold based on business impact, not statistical convenience.
My p-value is tiny but the effect is small—what do I do?
Report it honestly: 'Statistically significant improvement of 0.3% (95% CI: 0.1% to 0.5%, p<0.001). This difference is unlikely due to chance but may not be practically meaningful given [costs/user perception/etc.].' Let stakeholders make the decision with full information.
How do I determine what effect size is meaningful?
Work backwards from impact: How much improvement would change a user's experience noticeably? How much improvement would justify the cost of deployment? What improvement would move a business metric? Use pilot studies, expert judgment, or historical data to calibrate expectations.

Key Takeaway

Statistical significance tells you an effect is real (not chance); practical significance tells you it matters. With large evaluation sets, even tiny improvements become statistically significant. Pre-specify your minimum important difference: the smallest effect worth caring about. Report confidence intervals so readers can judge both statistical and practical significance. A 0.5% improvement with p<0.001 is undeniably real—but it might also be undeniably unimportant.

Send to a friend

Share this with someone who loves clean statistical work.