Model Evaluation

Multiple Prompts and Metrics: Controlling False Discoveries in Evals

When evaluating models across many prompts or metrics, false positives multiply. Learn how to control false discovery rate and make defensible claims about model improvements.

Share

Quick Hits

  • Test 20 metrics at α=0.05: expect 1 false positive even with no real effect
  • Bonferroni: multiply p-values by number of tests (conservative)
  • FDR control: accept that some discoveries are false, but limit the proportion
  • Pre-specify primary metrics to avoid correction entirely
  • Report raw AND adjusted p-values for transparency

TL;DR

Evaluating models on many prompts or metrics inflates false positive rates. Testing 20 metrics at α=0.05 means you expect one false positive even with no real improvements. Multiple comparison corrections—Bonferroni (conservative) or FDR methods (balanced)—control this inflation. Better yet: pre-specify primary metrics to limit tests. This guide covers when and how to correct for multiple comparisons in model evaluation.


The Multiple Testing Problem

Why It Matters

You're evaluating a new LLM across 10 task categories and 5 metrics (50 tests total).

At α = 0.05 with no real differences:

  • Each test has 5% chance of false positive
  • Expected false positives = 50 × 0.05 = 2.5
  • Probability of at least one false positive ≈ 92%

The Math

Family-wise error rate (FWER): P(at least one false positive)

For m independent tests at level α: $$\text{FWER} = 1 - (1-\alpha)^m$$

Tests (m) FWER at α=0.05
1 5%
5 23%
10 40%
20 64%
50 92%

Correction Methods

Bonferroni Correction

Method: Divide α by number of tests (or multiply p-values by m)

$$\alpha_{adj} = \frac{\alpha}{m}$$

import numpy as np
from scipy import stats


def bonferroni_correction(p_values, alpha=0.05):
    """
    Bonferroni correction for multiple comparisons.

    Controls FWER: P(any false positive) ≤ α
    """
    m = len(p_values)
    adjusted = np.minimum(np.array(p_values) * m, 1.0)

    return {
        'original': p_values,
        'adjusted': adjusted.tolist(),
        'significant_original': [p < alpha for p in p_values],
        'significant_adjusted': [p < alpha for p in adjusted],
        'threshold_per_test': alpha / m,
        'method': 'Bonferroni'
    }


# Example: 10 metrics evaluated
np.random.seed(42)
p_values = [0.001, 0.012, 0.023, 0.034, 0.045, 0.078, 0.089, 0.12, 0.45, 0.78]

result = bonferroni_correction(p_values)

print("Bonferroni Correction")
print("=" * 60)
print(f"Number of tests: {len(p_values)}")
print(f"Original α: 0.05")
print(f"Adjusted α per test: {result['threshold_per_test']:.4f}")
print(f"\n{'Metric':<10} {'Raw p':>10} {'Adjusted p':>12} {'Sig (raw)':>12} {'Sig (adj)':>12}")
print("-" * 60)

for i, (raw, adj, sig_r, sig_a) in enumerate(zip(
    result['original'], result['adjusted'],
    result['significant_original'], result['significant_adjusted']
)):
    print(f"Metric {i+1:<3} {raw:>10.4f} {adj:>12.4f} {str(sig_r):>12} {str(sig_a):>12}")

print(f"\nSignificant (raw): {sum(result['significant_original'])}")
print(f"Significant (adjusted): {sum(result['significant_adjusted'])}")

Holm-Bonferroni (Step-Down)

Method: Less conservative than Bonferroni, still controls FWER

def holm_correction(p_values, alpha=0.05):
    """
    Holm-Bonferroni step-down procedure.

    More powerful than Bonferroni while controlling FWER.
    """
    m = len(p_values)
    p_array = np.array(p_values)

    # Sort p-values (keep track of original indices)
    sorted_idx = np.argsort(p_array)
    sorted_p = p_array[sorted_idx]

    # Adjusted p-values
    adjusted = np.zeros(m)

    for i, idx in enumerate(sorted_idx):
        multiplier = m - i
        adjusted[idx] = sorted_p[i] * multiplier

    # Enforce monotonicity
    adjusted_sorted = adjusted[sorted_idx]
    for i in range(1, m):
        if adjusted_sorted[i] < adjusted_sorted[i-1]:
            adjusted_sorted[i] = adjusted_sorted[i-1]
    adjusted[sorted_idx] = adjusted_sorted

    adjusted = np.minimum(adjusted, 1.0)

    return {
        'original': p_values,
        'adjusted': adjusted.tolist(),
        'significant': [p < alpha for p in adjusted],
        'method': 'Holm-Bonferroni'
    }


# Compare Bonferroni vs Holm
bonf = bonferroni_correction(p_values)
holm = holm_correction(p_values)

print("\nBonferroni vs Holm Comparison")
print("=" * 50)
print(f"Significant (Bonferroni): {sum(bonf['significant_adjusted'])}")
print(f"Significant (Holm): {sum(holm['significant'])}")
print("\nHolm is less conservative but still controls FWER")

Benjamini-Hochberg (FDR Control)

Method: Controls false discovery rate, not FWER

def benjamini_hochberg(p_values, alpha=0.05):
    """
    Benjamini-Hochberg procedure for FDR control.

    FDR = E[false discoveries / total discoveries]
    """
    m = len(p_values)
    p_array = np.array(p_values)

    # Sort p-values
    sorted_idx = np.argsort(p_array)
    sorted_p = p_array[sorted_idx]

    # BH threshold: p_(i) ≤ i/m * α
    thresholds = np.arange(1, m + 1) / m * alpha

    # Find largest k where p_(k) ≤ threshold_k
    significant = np.zeros(m, dtype=bool)
    k = 0
    for i in range(m):
        if sorted_p[i] <= thresholds[i]:
            k = i + 1

    # All p-values up to k are significant
    significant[sorted_idx[:k]] = True

    # Adjusted p-values (q-values)
    adjusted = np.zeros(m)
    for i, idx in enumerate(sorted_idx):
        adjusted[idx] = sorted_p[i] * m / (i + 1)

    # Enforce monotonicity (descending order)
    for i in range(m - 2, -1, -1):
        idx = sorted_idx[i]
        next_idx = sorted_idx[i + 1]
        adjusted[idx] = min(adjusted[idx], adjusted[next_idx])

    adjusted = np.minimum(adjusted, 1.0)

    return {
        'original': p_values,
        'adjusted': adjusted.tolist(),  # q-values
        'significant': significant.tolist(),
        'n_significant': int(significant.sum()),
        'expected_false_discoveries': significant.sum() * alpha if significant.sum() > 0 else 0,
        'method': 'Benjamini-Hochberg (FDR)'
    }


# Example
bh = benjamini_hochberg(p_values)

print("\nBenjamini-Hochberg FDR Control")
print("=" * 50)
print(f"Significant discoveries: {bh['n_significant']}")
print(f"Expected false discoveries (at FDR=0.05): {bh['expected_false_discoveries']:.1f}")

print(f"\n{'Metric':<10} {'Raw p':>10} {'q-value':>10} {'Significant':>12}")
print("-" * 45)
for i, (raw, q, sig) in enumerate(zip(bh['original'], bh['adjusted'], bh['significant'])):
    print(f"Metric {i+1:<3} {raw:>10.4f} {q:>10.4f} {str(sig):>12}")

Comparison of Methods

def compare_all_methods(p_values, alpha=0.05):
    """
    Compare all correction methods.
    """
    bonf = bonferroni_correction(p_values, alpha)
    holm = holm_correction(p_values, alpha)
    bh = benjamini_hochberg(p_values, alpha)

    return {
        'n_tests': len(p_values),
        'n_raw_significant': sum(p < alpha for p in p_values),
        'n_bonferroni': sum(bonf['significant_adjusted']),
        'n_holm': sum(holm['significant']),
        'n_bh': bh['n_significant'],
        'methods': {
            'bonferroni': bonf,
            'holm': holm,
            'bh': bh
        }
    }


comparison = compare_all_methods(p_values)

print("Method Comparison Summary")
print("=" * 50)
print(f"Total tests: {comparison['n_tests']}")
print(f"Significant (uncorrected): {comparison['n_raw_significant']}")
print(f"Significant (Bonferroni): {comparison['n_bonferroni']}")
print(f"Significant (Holm): {comparison['n_holm']}")
print(f"Significant (BH/FDR): {comparison['n_bh']}")
Method Controls Conservativeness Use When
None Nothing Liberal Pre-specified single test
Bonferroni FWER Very conservative High stakes, few tests
Holm FWER Conservative Better than Bonferroni
BH (FDR) FDR Moderate Exploratory, many tests

Pre-Specification Strategy

The Best Correction: Fewer Tests

def pre_specified_analysis(metrics_dict, primary_metrics, secondary_metrics=None, alpha=0.05):
    """
    Analysis with pre-specified primary and secondary metrics.

    Primary metrics: no correction (or minimal)
    Secondary metrics: FDR correction
    """
    results = {
        'primary': {},
        'secondary': {}
    }

    # Primary metrics: test at full alpha (pre-specified)
    for name in primary_metrics:
        p = metrics_dict[name]
        results['primary'][name] = {
            'p_value': p,
            'significant': p < alpha,
            'note': 'Primary (no correction needed)'
        }

    # Secondary metrics: FDR correction
    if secondary_metrics:
        secondary_p = [metrics_dict[name] for name in secondary_metrics]
        bh = benjamini_hochberg(secondary_p, alpha)

        for i, name in enumerate(secondary_metrics):
            results['secondary'][name] = {
                'p_value': metrics_dict[name],
                'q_value': bh['adjusted'][i],
                'significant': bh['significant'][i],
                'note': 'Secondary (FDR-controlled)'
            }

    return results


# Example: Pre-specified evaluation plan
metrics = {
    'accuracy': 0.023,
    'f1': 0.034,
    'auc': 0.012,
    'precision': 0.089,
    'recall': 0.045,
    'specificity': 0.156,
    'mcc': 0.067,
    'brier': 0.234
}

# Pre-registration: AUC is primary, others are exploratory
results = pre_specified_analysis(
    metrics,
    primary_metrics=['auc'],
    secondary_metrics=['accuracy', 'f1', 'precision', 'recall', 'mcc']
)

print("Pre-Specified Analysis")
print("=" * 50)
print("\nPrimary Metric (no correction):")
for name, r in results['primary'].items():
    print(f"  {name}: p={r['p_value']:.3f}, significant={r['significant']}")

print("\nSecondary Metrics (FDR-controlled):")
for name, r in results['secondary'].items():
    print(f"  {name}: p={r['p_value']:.3f}, q={r['q_value']:.3f}, significant={r['significant']}")

Practical Guidelines

When to Correct

Scenario Correction
Pre-specified single primary metric None
2-3 pre-specified primary metrics Holm (or none with clear pre-reg)
Many metrics, exploratory BH (FDR)
High-stakes (safety, regulatory) Bonferroni
Internal iteration None (but track patterns)

Reporting Template

## Statistical Methods

### Primary Analysis
We pre-specified AUC as the primary metric. Results are reported
at α=0.05 without multiple comparison correction.

### Secondary Analysis
Eight secondary metrics were evaluated with Benjamini-Hochberg
FDR control at q=0.05. Expected false discoveries among
reported significant results: <1.

### Results
| Metric | p-value | Adjusted (q) | Significant |
|--------|---------|--------------|-------------|
| AUC* | 0.012 | N/A | Yes |
| F1 | 0.034 | 0.068 | No |
| Accuracy | 0.023 | 0.092 | No |

*Primary metric

R Implementation

# Bonferroni
p.adjust(p_values, method = "bonferroni")

# Holm
p.adjust(p_values, method = "holm")

# Benjamini-Hochberg
p.adjust(p_values, method = "BH")

# All methods
sapply(c("none", "bonferroni", "holm", "BH"),
       function(m) sum(p.adjust(p_values, method = m) < 0.05))


Key Takeaway

Multiple testing is unavoidable in model evaluation—you want comprehensive information. But each test adds false positive risk. Manage this through: (1) pre-specification to limit the number of comparisons requiring correction, (2) FDR control (Benjamini-Hochberg) for exploratory analyses, and (3) transparent reporting of both raw and adjusted p-values. The goal isn't to eliminate false positives—it's to know and control their rate. "Of our 6 significant findings, we expect fewer than 1 to be false (FDR < 17%)" is a defensible claim.


References

  1. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
  2. https://www.jstor.org/stable/2346101
  3. https://arxiv.org/abs/2009.03196
  4. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. *JRSS B*, 57(1), 289-300.
  5. Holm, S. (1979). A simple sequentially rejective multiple test procedure. *Scandinavian Journal of Statistics*, 6(2), 65-70.
  6. Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond "p < 0.05." *The American Statistician*, 73(sup1), 1-19.

Frequently Asked Questions

When should I use Bonferroni vs. FDR correction?
Use Bonferroni when you need high confidence in every significant result (regulatory, high-stakes). Use FDR when you're okay with some false positives as long as the proportion is controlled—typical for exploratory evaluation where you'll validate findings later.
Can I avoid multiple comparison correction?
Yes, by pre-specifying a single primary metric or a small set of primary metrics. Correction is for tests you run, not tests you could have run. If you pre-commit to evaluating only AUC and F1, you only adjust for 2 tests. But you must pre-commit before seeing data.
How many tests is 'too many'?
There's no hard cutoff. The issue is the expected number of false positives: tests × α. At α=0.05 with 100 tests, you expect 5 false positives. If that's acceptable for your use case, proceed with caution. If not, correct or reduce the number of tests.

Key Takeaway

Multiple testing is inevitable in model evaluation—you want to know about many metrics and prompts. But each test adds false positive risk. Control this by: (1) pre-specifying primary metrics, (2) adjusting p-values with Bonferroni or FDR methods, and (3) reporting both raw and adjusted results. The goal is defensible conclusions: 'Of the 8 improvements we report, we expect at most 1 to be false' (FDR = 12.5%).

Send to a friend

Share this with someone who loves clean statistical work.