Model Evaluation

Bootstrap for Metric Deltas: AUC, F1, and Other ML Metrics

How to compute confidence intervals and p-values for differences in ML metrics like AUC, F1, and precision. Learn paired bootstrap for defensible model comparisons.

Jan 267 min readstatstest_flow Model Evaluation Supporting

Bootstrap for Metric Deltas: AUC, F1, and Other ML Metrics

Quick Hits

•Most ML metrics don't have closed-form variance—bootstrap is the answer
•Paired bootstrap: resample examples, compute metric for both models on same resample
•5000+ resamples for CI bounds, 10000+ for p-values
•Check if CI excludes zero, not just the point estimate
•Report the uncertainty: 'AUC improved by 0.02 (95% CI: 0.005 to 0.035)'

TL;DR

ML metrics like AUC, F1, and precision don't have simple variance formulas. Bootstrap provides the solution: resample your test data, compute metrics on each resample, and use the distribution for inference. For comparing models, use paired bootstrap—resample examples while keeping both models' predictions together. This gives you confidence intervals and p-values for metric differences that are defensible and accurate.

Why Bootstrap for ML Metrics?

The Problem

You computed AUC = 0.85 for Model A and AUC = 0.83 for Model B.

Questions:

Is that 0.02 difference statistically significant?
What's the confidence interval?
Could it be sampling noise?

Why Not Analytical Formulas?

Metric	Analytical Variance	Complexity
Accuracy	Yes (binomial)	Simple
AUC	Yes (DeLong)	Moderate
F1	No	Complex
Precision@k	No	Complex
Custom metrics	No	Varies

Bootstrap handles all of these uniformly.

Paired Bootstrap for Model Comparison

The Algorithm

import numpy as np
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score


def paired_bootstrap_comparison(y_true, pred_a, pred_b, metric_fn,
                                  n_bootstrap=5000, seed=None):
    """
    Paired bootstrap comparison of two models.

    Parameters:
    -----------
    y_true : array
        True labels
    pred_a : array
        Model A predictions (probabilities for AUC, class labels for F1)
    pred_b : array
        Model B predictions
    metric_fn : callable
        Function(y_true, y_pred) -> metric value
    n_bootstrap : int
        Number of bootstrap resamples

    Returns:
    --------
    dict with metrics, difference CI, and p-value
    """
    if seed is not None:
        np.random.seed(seed)

    n = len(y_true)

    # Original metrics
    metric_a = metric_fn(y_true, pred_a)
    metric_b = metric_fn(y_true, pred_b)
    diff_observed = metric_a - metric_b

    # Bootstrap
    boot_diffs = []

    for _ in range(n_bootstrap):
        # Sample indices WITH replacement
        idx = np.random.choice(n, n, replace=True)

        # Compute metrics on same resample for both models
        boot_a = metric_fn(y_true[idx], pred_a[idx])
        boot_b = metric_fn(y_true[idx], pred_b[idx])

        boot_diffs.append(boot_a - boot_b)

    boot_diffs = np.array(boot_diffs)

    # Confidence interval (percentile method)
    ci_lower = np.percentile(boot_diffs, 2.5)
    ci_upper = np.percentile(boot_diffs, 97.5)

    # P-value (two-sided test against H0: diff = 0)
    if diff_observed > 0:
        p_value = 2 * np.mean(boot_diffs <= 0)
    else:
        p_value = 2 * np.mean(boot_diffs >= 0)
    p_value = min(p_value, 1.0)

    # Alternative: check if CI excludes 0
    significant_by_ci = ci_lower > 0 or ci_upper < 0

    return {
        'metric_a': metric_a,
        'metric_b': metric_b,
        'difference': diff_observed,
        'se': np.std(boot_diffs),
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'p_value': p_value,
        'significant': significant_by_ci,
        'n_bootstrap': n_bootstrap
    }


# Example: AUC comparison
np.random.seed(42)
n = 1000

# Simulate data
y_true = np.random.binomial(1, 0.3, n)  # 30% positive

# Model A: Good model
prob_a = y_true * np.random.beta(8, 2, n) + (1 - y_true) * np.random.beta(2, 8, n)

# Model B: Slightly worse
prob_b = y_true * np.random.beta(7, 2, n) + (1 - y_true) * np.random.beta(2, 7, n)

result = paired_bootstrap_comparison(
    y_true, prob_a, prob_b,
    metric_fn=roc_auc_score,
    n_bootstrap=5000,
    seed=42
)

print("Bootstrap AUC Comparison")
print("=" * 50)
print(f"Model A AUC: {result['metric_a']:.4f}")
print(f"Model B AUC: {result['metric_b']:.4f}")
print(f"Difference (A - B): {result['difference']:.4f}")
print(f"Bootstrap SE: {result['se']:.4f}")
print(f"95% CI: ({result['ci_lower']:.4f}, {result['ci_upper']:.4f})")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant (CI excludes 0): {result['significant']}")

Multiple Metrics

Comparing Across Metrics

def multi_metric_comparison(y_true, pred_a, pred_b, prob_a=None, prob_b=None,
                              n_bootstrap=5000, seed=None):
    """
    Compare models across multiple metrics.

    Uses class predictions for F1/precision/recall, probabilities for AUC.
    """
    if seed is not None:
        np.random.seed(seed)

    results = {}

    # Metrics with class predictions
    class_metrics = {
        'accuracy': lambda y, p: np.mean(y == p),
        'f1': lambda y, p: f1_score(y, p, zero_division=0),
        'precision': lambda y, p: precision_score(y, p, zero_division=0),
        'recall': lambda y, p: recall_score(y, p, zero_division=0)
    }

    for name, metric_fn in class_metrics.items():
        result = paired_bootstrap_comparison(
            y_true, pred_a, pred_b,
            metric_fn=metric_fn,
            n_bootstrap=n_bootstrap
        )
        results[name] = result

    # AUC (needs probabilities)
    if prob_a is not None and prob_b is not None:
        result = paired_bootstrap_comparison(
            y_true, prob_a, prob_b,
            metric_fn=roc_auc_score,
            n_bootstrap=n_bootstrap
        )
        results['auc'] = result

    return results


# Example
np.random.seed(42)
n = 1000

y_true = np.random.binomial(1, 0.3, n)

# Probabilities
prob_a = y_true * np.random.beta(8, 2, n) + (1 - y_true) * np.random.beta(2, 8, n)
prob_b = y_true * np.random.beta(7, 3, n) + (1 - y_true) * np.random.beta(3, 7, n)

# Class predictions (threshold 0.5)
pred_a = (prob_a > 0.5).astype(int)
pred_b = (prob_b > 0.5).astype(int)

results = multi_metric_comparison(y_true, pred_a, pred_b, prob_a, prob_b)

print("Multi-Metric Comparison")
print("=" * 70)
print(f"{'Metric':<12} {'A':>10} {'B':>10} {'Diff':>10} {'95% CI':>20} {'p-value':>10}")
print("-" * 70)

for name, r in results.items():
    ci_str = f"({r['ci_lower']:.3f}, {r['ci_upper']:.3f})"
    print(f"{name:<12} {r['metric_a']:>10.3f} {r['metric_b']:>10.3f} "
          f"{r['difference']:>+10.3f} {ci_str:>20} {r['p_value']:>10.4f}")

Handling Threshold-Dependent Metrics

For F1, precision, recall—the threshold matters:

def bootstrap_at_threshold(y_true, prob_a, prob_b, threshold=0.5, n_bootstrap=5000):
    """
    Bootstrap comparison with explicit threshold.
    """
    pred_a = (prob_a >= threshold).astype(int)
    pred_b = (prob_b >= threshold).astype(int)

    return paired_bootstrap_comparison(
        y_true, pred_a, pred_b,
        metric_fn=f1_score,
        n_bootstrap=n_bootstrap
    )


def bootstrap_across_thresholds(y_true, prob_a, prob_b,
                                 thresholds=[0.3, 0.4, 0.5, 0.6, 0.7]):
    """
    Compare F1 at multiple thresholds.
    """
    results = []
    for t in thresholds:
        result = bootstrap_at_threshold(y_true, prob_a, prob_b, threshold=t)
        result['threshold'] = t
        results.append(result)
    return results


# Example
threshold_results = bootstrap_across_thresholds(y_true, prob_a, prob_b)

print("\nF1 Comparison Across Thresholds")
print("=" * 60)
print(f"{'Threshold':>10} {'F1 A':>10} {'F1 B':>10} {'Diff':>10} {'Significant':>12}")
print("-" * 60)

for r in threshold_results:
    print(f"{r['threshold']:>10.1f} {r['metric_a']:>10.3f} {r['metric_b']:>10.3f} "
          f"{r['difference']:>+10.3f} {str(r['significant']):>12}")

BCa Bootstrap (More Accurate CIs)

from scipy import stats


def bca_bootstrap_comparison(y_true, pred_a, pred_b, metric_fn, n_bootstrap=5000):
    """
    BCa (bias-corrected and accelerated) bootstrap for metric comparison.

    More accurate CIs than percentile method for skewed distributions.
    """
    n = len(y_true)

    # Original difference
    diff_obs = metric_fn(y_true, pred_a) - metric_fn(y_true, pred_b)

    # Standard bootstrap
    boot_diffs = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, n, replace=True)
        boot_diff = metric_fn(y_true[idx], pred_a[idx]) - metric_fn(y_true[idx], pred_b[idx])
        boot_diffs.append(boot_diff)
    boot_diffs = np.array(boot_diffs)

    # Bias correction (z0)
    prop_less = np.mean(boot_diffs < diff_obs)
    z0 = stats.norm.ppf(prop_less) if 0 < prop_less < 1 else 0

    # Acceleration (jackknife)
    jack_diffs = []
    for i in range(n):
        idx = np.concatenate([np.arange(i), np.arange(i+1, n)])
        jack_diff = metric_fn(y_true[idx], pred_a[idx]) - metric_fn(y_true[idx], pred_b[idx])
        jack_diffs.append(jack_diff)

    jack_diffs = np.array(jack_diffs)
    jack_mean = jack_diffs.mean()
    num = np.sum((jack_mean - jack_diffs)**3)
    denom = 6 * (np.sum((jack_mean - jack_diffs)**2)**1.5)
    a = num / denom if denom != 0 else 0

    # BCa adjusted percentiles
    alpha = 0.05
    z_low = stats.norm.ppf(alpha / 2)
    z_high = stats.norm.ppf(1 - alpha / 2)

    def bca_quantile(z):
        return stats.norm.cdf(z0 + (z0 + z) / (1 - a * (z0 + z)))

    p_low = bca_quantile(z_low)
    p_high = bca_quantile(z_high)

    ci_lower = np.percentile(boot_diffs, 100 * p_low)
    ci_upper = np.percentile(boot_diffs, 100 * p_high)

    return {
        'difference': diff_obs,
        'ci_lower_percentile': np.percentile(boot_diffs, 2.5),
        'ci_upper_percentile': np.percentile(boot_diffs, 97.5),
        'ci_lower_bca': ci_lower,
        'ci_upper_bca': ci_upper,
        'bias_correction': z0,
        'acceleration': a
    }


# Example
result = bca_bootstrap_comparison(y_true, prob_a, prob_b, roc_auc_score)

print("BCa vs Percentile Bootstrap")
print("=" * 50)
print(f"Difference: {result['difference']:.4f}")
print(f"\nPercentile CI: ({result['ci_lower_percentile']:.4f}, {result['ci_upper_percentile']:.4f})")
print(f"BCa CI:        ({result['ci_lower_bca']:.4f}, {result['ci_upper_bca']:.4f})")

Single Model Confidence Interval

Sometimes you just want CI for one metric:

def bootstrap_single_metric(y_true, predictions, metric_fn, n_bootstrap=5000):
    """
    Bootstrap CI for a single model's metric.
    """
    n = len(y_true)
    metric_obs = metric_fn(y_true, predictions)

    boot_metrics = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, n, replace=True)
        boot_metric = metric_fn(y_true[idx], predictions[idx])
        boot_metrics.append(boot_metric)

    boot_metrics = np.array(boot_metrics)

    return {
        'metric': metric_obs,
        'se': np.std(boot_metrics),
        'ci_lower': np.percentile(boot_metrics, 2.5),
        'ci_upper': np.percentile(boot_metrics, 97.5)
    }


# Example
result = bootstrap_single_metric(y_true, prob_a, roc_auc_score)
print(f"AUC: {result['metric']:.4f} (95% CI: {result['ci_lower']:.4f} to {result['ci_upper']:.4f})")

R Implementation

library(boot)
library(pROC)

# Bootstrap for AUC difference
boot_auc_diff <- function(data, indices) {
    d <- data[indices, ]
    auc_a <- auc(d$y_true, d$prob_a)
    auc_b <- auc(d$y_true, d$prob_b)
    as.numeric(auc_a - auc_b)
}

data <- data.frame(y_true = y_true, prob_a = prob_a, prob_b = prob_b)
boot_result <- boot(data, boot_auc_diff, R = 5000)

# BCa confidence interval
boot.ci(boot_result, type = "bca")

# DeLong test for AUC (analytical)
roc.test(roc(y_true, prob_a), roc(y_true, prob_b), method = "delong")

Reporting Template

## Model Comparison Results

### AUC
- Model A: 0.856 (95% CI: 0.831-0.879)
- Model B: 0.842 (95% CI: 0.815-0.867)
- Difference: +0.014 (95% CI: -0.002 to +0.030)
- p-value: 0.089
- **Interpretation**: Model A shows a trend toward better discrimination,
  but the difference is not statistically significant at α=0.05.

### F1 Score (threshold = 0.5)
- Model A: 0.721 (95% CI: 0.685-0.755)
- Model B: 0.698 (95% CI: 0.661-0.734)
- Difference: +0.023 (95% CI: -0.008 to +0.054)
- p-value: 0.142

### Methodology
Bootstrap confidence intervals computed using 5,000 paired resamples
with BCa correction. P-values computed as proportion of bootstrap
samples with difference on opposite side of zero from observed.

Model Evaluation (Pillar) - Complete framework
Bootstrap for Heavy-Tailed Metrics - Bootstrap deep dive
Comparing Models: Win Rate - Win rate analysis
Delta Method vs. Bootstrap - Method comparison

Key Takeaway

Bootstrap is the standard tool for ML metric uncertainty. When comparing models: (1) use paired bootstrap—resample examples while keeping both models' predictions together; (2) compute the metric difference on each resample; (3) use the bootstrap distribution for CIs and p-values. Report the uncertainty: "AUC improved by 0.02 (95% CI: 0.005 to 0.035)" is far more informative than "AUC improved from 0.83 to 0.85." Without uncertainty quantification, you can't distinguish real improvements from noise.

References

https://doi.org/10.1002/sim.5777
https://www.jstor.org/stable/2965714
https://arxiv.org/abs/1811.00062
Efron, B., & Tibshirani, R. J. (1993). *An Introduction to the Bootstrap*. Chapman & Hall.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. *Radiology*, 143(1), 29-36.
Carpenter, J., & Bithell, J. (2000). Bootstrap confidence intervals: when, which, what? *Statistics in Medicine*, 19(9), 1141-1164.

Frequently Asked Questions

Why bootstrap instead of DeLong's test for AUC?

DeLong's test is valid and efficient for AUC specifically. But bootstrap is general—works for F1, precision, recall, and any custom metric. Use DeLong for AUC if available; use bootstrap when you need flexibility or when comparing multiple metrics.

Should I use paired or unpaired bootstrap?

Paired when comparing models on the same examples (almost always in model evaluation). Resample examples, not predictions separately. This preserves the pairing and gives correct inference for 'is A better than B on these examples?'

How do I get a p-value from bootstrap?

For H0: diff = 0, compute the proportion of bootstrap differences on the opposite side of zero from your observed difference. Double it for two-sided. Alternatively, check if the bootstrap CI excludes zero—this is equivalent to p < alpha.

Key Takeaway

Bootstrap is the standard tool for uncertainty quantification in ML metrics. It handles any metric, works for paired comparisons, and makes minimal assumptions. For model comparison: resample examples (preserving pairs), compute the metric difference on each resample, and use the distribution for CIs and hypothesis tests. Always report uncertainty—a 2% AUC improvement means nothing without knowing if it could be 0.5% or 3.5%.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email