Contents
Bootstrap for Metric Deltas: AUC, F1, and Other ML Metrics
How to compute confidence intervals and p-values for differences in ML metrics like AUC, F1, and precision. Learn paired bootstrap for defensible model comparisons.
Quick Hits
- •Most ML metrics don't have closed-form variance—bootstrap is the answer
- •Paired bootstrap: resample examples, compute metric for both models on same resample
- •5000+ resamples for CI bounds, 10000+ for p-values
- •Check if CI excludes zero, not just the point estimate
- •Report the uncertainty: 'AUC improved by 0.02 (95% CI: 0.005 to 0.035)'
TL;DR
ML metrics like AUC, F1, and precision don't have simple variance formulas. Bootstrap provides the solution: resample your test data, compute metrics on each resample, and use the distribution for inference. For comparing models, use paired bootstrap—resample examples while keeping both models' predictions together. This gives you confidence intervals and p-values for metric differences that are defensible and accurate.
Why Bootstrap for ML Metrics?
The Problem
You computed AUC = 0.85 for Model A and AUC = 0.83 for Model B.
Questions:
- Is that 0.02 difference statistically significant?
- What's the confidence interval?
- Could it be sampling noise?
Why Not Analytical Formulas?
| Metric | Analytical Variance | Complexity |
|---|---|---|
| Accuracy | Yes (binomial) | Simple |
| AUC | Yes (DeLong) | Moderate |
| F1 | No | Complex |
| Precision@k | No | Complex |
| Custom metrics | No | Varies |
Bootstrap handles all of these uniformly.
Paired Bootstrap for Model Comparison
The Algorithm
import numpy as np
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score
def paired_bootstrap_comparison(y_true, pred_a, pred_b, metric_fn,
n_bootstrap=5000, seed=None):
"""
Paired bootstrap comparison of two models.
Parameters:
-----------
y_true : array
True labels
pred_a : array
Model A predictions (probabilities for AUC, class labels for F1)
pred_b : array
Model B predictions
metric_fn : callable
Function(y_true, y_pred) -> metric value
n_bootstrap : int
Number of bootstrap resamples
Returns:
--------
dict with metrics, difference CI, and p-value
"""
if seed is not None:
np.random.seed(seed)
n = len(y_true)
# Original metrics
metric_a = metric_fn(y_true, pred_a)
metric_b = metric_fn(y_true, pred_b)
diff_observed = metric_a - metric_b
# Bootstrap
boot_diffs = []
for _ in range(n_bootstrap):
# Sample indices WITH replacement
idx = np.random.choice(n, n, replace=True)
# Compute metrics on same resample for both models
boot_a = metric_fn(y_true[idx], pred_a[idx])
boot_b = metric_fn(y_true[idx], pred_b[idx])
boot_diffs.append(boot_a - boot_b)
boot_diffs = np.array(boot_diffs)
# Confidence interval (percentile method)
ci_lower = np.percentile(boot_diffs, 2.5)
ci_upper = np.percentile(boot_diffs, 97.5)
# P-value (two-sided test against H0: diff = 0)
if diff_observed > 0:
p_value = 2 * np.mean(boot_diffs <= 0)
else:
p_value = 2 * np.mean(boot_diffs >= 0)
p_value = min(p_value, 1.0)
# Alternative: check if CI excludes 0
significant_by_ci = ci_lower > 0 or ci_upper < 0
return {
'metric_a': metric_a,
'metric_b': metric_b,
'difference': diff_observed,
'se': np.std(boot_diffs),
'ci_lower': ci_lower,
'ci_upper': ci_upper,
'p_value': p_value,
'significant': significant_by_ci,
'n_bootstrap': n_bootstrap
}
# Example: AUC comparison
np.random.seed(42)
n = 1000
# Simulate data
y_true = np.random.binomial(1, 0.3, n) # 30% positive
# Model A: Good model
prob_a = y_true * np.random.beta(8, 2, n) + (1 - y_true) * np.random.beta(2, 8, n)
# Model B: Slightly worse
prob_b = y_true * np.random.beta(7, 2, n) + (1 - y_true) * np.random.beta(2, 7, n)
result = paired_bootstrap_comparison(
y_true, prob_a, prob_b,
metric_fn=roc_auc_score,
n_bootstrap=5000,
seed=42
)
print("Bootstrap AUC Comparison")
print("=" * 50)
print(f"Model A AUC: {result['metric_a']:.4f}")
print(f"Model B AUC: {result['metric_b']:.4f}")
print(f"Difference (A - B): {result['difference']:.4f}")
print(f"Bootstrap SE: {result['se']:.4f}")
print(f"95% CI: ({result['ci_lower']:.4f}, {result['ci_upper']:.4f})")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant (CI excludes 0): {result['significant']}")
Multiple Metrics
Comparing Across Metrics
def multi_metric_comparison(y_true, pred_a, pred_b, prob_a=None, prob_b=None,
n_bootstrap=5000, seed=None):
"""
Compare models across multiple metrics.
Uses class predictions for F1/precision/recall, probabilities for AUC.
"""
if seed is not None:
np.random.seed(seed)
results = {}
# Metrics with class predictions
class_metrics = {
'accuracy': lambda y, p: np.mean(y == p),
'f1': lambda y, p: f1_score(y, p, zero_division=0),
'precision': lambda y, p: precision_score(y, p, zero_division=0),
'recall': lambda y, p: recall_score(y, p, zero_division=0)
}
for name, metric_fn in class_metrics.items():
result = paired_bootstrap_comparison(
y_true, pred_a, pred_b,
metric_fn=metric_fn,
n_bootstrap=n_bootstrap
)
results[name] = result
# AUC (needs probabilities)
if prob_a is not None and prob_b is not None:
result = paired_bootstrap_comparison(
y_true, prob_a, prob_b,
metric_fn=roc_auc_score,
n_bootstrap=n_bootstrap
)
results['auc'] = result
return results
# Example
np.random.seed(42)
n = 1000
y_true = np.random.binomial(1, 0.3, n)
# Probabilities
prob_a = y_true * np.random.beta(8, 2, n) + (1 - y_true) * np.random.beta(2, 8, n)
prob_b = y_true * np.random.beta(7, 3, n) + (1 - y_true) * np.random.beta(3, 7, n)
# Class predictions (threshold 0.5)
pred_a = (prob_a > 0.5).astype(int)
pred_b = (prob_b > 0.5).astype(int)
results = multi_metric_comparison(y_true, pred_a, pred_b, prob_a, prob_b)
print("Multi-Metric Comparison")
print("=" * 70)
print(f"{'Metric':<12} {'A':>10} {'B':>10} {'Diff':>10} {'95% CI':>20} {'p-value':>10}")
print("-" * 70)
for name, r in results.items():
ci_str = f"({r['ci_lower']:.3f}, {r['ci_upper']:.3f})"
print(f"{name:<12} {r['metric_a']:>10.3f} {r['metric_b']:>10.3f} "
f"{r['difference']:>+10.3f} {ci_str:>20} {r['p_value']:>10.4f}")
Handling Threshold-Dependent Metrics
For F1, precision, recall—the threshold matters:
def bootstrap_at_threshold(y_true, prob_a, prob_b, threshold=0.5, n_bootstrap=5000):
"""
Bootstrap comparison with explicit threshold.
"""
pred_a = (prob_a >= threshold).astype(int)
pred_b = (prob_b >= threshold).astype(int)
return paired_bootstrap_comparison(
y_true, pred_a, pred_b,
metric_fn=f1_score,
n_bootstrap=n_bootstrap
)
def bootstrap_across_thresholds(y_true, prob_a, prob_b,
thresholds=[0.3, 0.4, 0.5, 0.6, 0.7]):
"""
Compare F1 at multiple thresholds.
"""
results = []
for t in thresholds:
result = bootstrap_at_threshold(y_true, prob_a, prob_b, threshold=t)
result['threshold'] = t
results.append(result)
return results
# Example
threshold_results = bootstrap_across_thresholds(y_true, prob_a, prob_b)
print("\nF1 Comparison Across Thresholds")
print("=" * 60)
print(f"{'Threshold':>10} {'F1 A':>10} {'F1 B':>10} {'Diff':>10} {'Significant':>12}")
print("-" * 60)
for r in threshold_results:
print(f"{r['threshold']:>10.1f} {r['metric_a']:>10.3f} {r['metric_b']:>10.3f} "
f"{r['difference']:>+10.3f} {str(r['significant']):>12}")
BCa Bootstrap (More Accurate CIs)
from scipy import stats
def bca_bootstrap_comparison(y_true, pred_a, pred_b, metric_fn, n_bootstrap=5000):
"""
BCa (bias-corrected and accelerated) bootstrap for metric comparison.
More accurate CIs than percentile method for skewed distributions.
"""
n = len(y_true)
# Original difference
diff_obs = metric_fn(y_true, pred_a) - metric_fn(y_true, pred_b)
# Standard bootstrap
boot_diffs = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, n, replace=True)
boot_diff = metric_fn(y_true[idx], pred_a[idx]) - metric_fn(y_true[idx], pred_b[idx])
boot_diffs.append(boot_diff)
boot_diffs = np.array(boot_diffs)
# Bias correction (z0)
prop_less = np.mean(boot_diffs < diff_obs)
z0 = stats.norm.ppf(prop_less) if 0 < prop_less < 1 else 0
# Acceleration (jackknife)
jack_diffs = []
for i in range(n):
idx = np.concatenate([np.arange(i), np.arange(i+1, n)])
jack_diff = metric_fn(y_true[idx], pred_a[idx]) - metric_fn(y_true[idx], pred_b[idx])
jack_diffs.append(jack_diff)
jack_diffs = np.array(jack_diffs)
jack_mean = jack_diffs.mean()
num = np.sum((jack_mean - jack_diffs)**3)
denom = 6 * (np.sum((jack_mean - jack_diffs)**2)**1.5)
a = num / denom if denom != 0 else 0
# BCa adjusted percentiles
alpha = 0.05
z_low = stats.norm.ppf(alpha / 2)
z_high = stats.norm.ppf(1 - alpha / 2)
def bca_quantile(z):
return stats.norm.cdf(z0 + (z0 + z) / (1 - a * (z0 + z)))
p_low = bca_quantile(z_low)
p_high = bca_quantile(z_high)
ci_lower = np.percentile(boot_diffs, 100 * p_low)
ci_upper = np.percentile(boot_diffs, 100 * p_high)
return {
'difference': diff_obs,
'ci_lower_percentile': np.percentile(boot_diffs, 2.5),
'ci_upper_percentile': np.percentile(boot_diffs, 97.5),
'ci_lower_bca': ci_lower,
'ci_upper_bca': ci_upper,
'bias_correction': z0,
'acceleration': a
}
# Example
result = bca_bootstrap_comparison(y_true, prob_a, prob_b, roc_auc_score)
print("BCa vs Percentile Bootstrap")
print("=" * 50)
print(f"Difference: {result['difference']:.4f}")
print(f"\nPercentile CI: ({result['ci_lower_percentile']:.4f}, {result['ci_upper_percentile']:.4f})")
print(f"BCa CI: ({result['ci_lower_bca']:.4f}, {result['ci_upper_bca']:.4f})")
Single Model Confidence Interval
Sometimes you just want CI for one metric:
def bootstrap_single_metric(y_true, predictions, metric_fn, n_bootstrap=5000):
"""
Bootstrap CI for a single model's metric.
"""
n = len(y_true)
metric_obs = metric_fn(y_true, predictions)
boot_metrics = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, n, replace=True)
boot_metric = metric_fn(y_true[idx], predictions[idx])
boot_metrics.append(boot_metric)
boot_metrics = np.array(boot_metrics)
return {
'metric': metric_obs,
'se': np.std(boot_metrics),
'ci_lower': np.percentile(boot_metrics, 2.5),
'ci_upper': np.percentile(boot_metrics, 97.5)
}
# Example
result = bootstrap_single_metric(y_true, prob_a, roc_auc_score)
print(f"AUC: {result['metric']:.4f} (95% CI: {result['ci_lower']:.4f} to {result['ci_upper']:.4f})")
R Implementation
library(boot)
library(pROC)
# Bootstrap for AUC difference
boot_auc_diff <- function(data, indices) {
d <- data[indices, ]
auc_a <- auc(d$y_true, d$prob_a)
auc_b <- auc(d$y_true, d$prob_b)
as.numeric(auc_a - auc_b)
}
data <- data.frame(y_true = y_true, prob_a = prob_a, prob_b = prob_b)
boot_result <- boot(data, boot_auc_diff, R = 5000)
# BCa confidence interval
boot.ci(boot_result, type = "bca")
# DeLong test for AUC (analytical)
roc.test(roc(y_true, prob_a), roc(y_true, prob_b), method = "delong")
Reporting Template
## Model Comparison Results
### AUC
- Model A: 0.856 (95% CI: 0.831-0.879)
- Model B: 0.842 (95% CI: 0.815-0.867)
- Difference: +0.014 (95% CI: -0.002 to +0.030)
- p-value: 0.089
- **Interpretation**: Model A shows a trend toward better discrimination,
but the difference is not statistically significant at α=0.05.
### F1 Score (threshold = 0.5)
- Model A: 0.721 (95% CI: 0.685-0.755)
- Model B: 0.698 (95% CI: 0.661-0.734)
- Difference: +0.023 (95% CI: -0.008 to +0.054)
- p-value: 0.142
### Methodology
Bootstrap confidence intervals computed using 5,000 paired resamples
with BCa correction. P-values computed as proportion of bootstrap
samples with difference on opposite side of zero from observed.
Related Methods
- Model Evaluation (Pillar) - Complete framework
- Bootstrap for Heavy-Tailed Metrics - Bootstrap deep dive
- Comparing Models: Win Rate - Win rate analysis
- Delta Method vs. Bootstrap - Method comparison
Key Takeaway
Bootstrap is the standard tool for ML metric uncertainty. When comparing models: (1) use paired bootstrap—resample examples while keeping both models' predictions together; (2) compute the metric difference on each resample; (3) use the bootstrap distribution for CIs and p-values. Report the uncertainty: "AUC improved by 0.02 (95% CI: 0.005 to 0.035)" is far more informative than "AUC improved from 0.83 to 0.85." Without uncertainty quantification, you can't distinguish real improvements from noise.
References
- https://doi.org/10.1002/sim.5777
- https://www.jstor.org/stable/2965714
- https://arxiv.org/abs/1811.00062
- Efron, B., & Tibshirani, R. J. (1993). *An Introduction to the Bootstrap*. Chapman & Hall.
- Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. *Radiology*, 143(1), 29-36.
- Carpenter, J., & Bithell, J. (2000). Bootstrap confidence intervals: when, which, what? *Statistics in Medicine*, 19(9), 1141-1164.
Frequently Asked Questions
Why bootstrap instead of DeLong's test for AUC?
Should I use paired or unpaired bootstrap?
How do I get a p-value from bootstrap?
Key Takeaway
Bootstrap is the standard tool for uncertainty quantification in ML metrics. It handles any metric, works for paired comparisons, and makes minimal assumptions. For model comparison: resample examples (preserving pairs), compute the metric difference on each resample, and use the distribution for CIs and hypothesis tests. Always report uncertainty—a 2% AUC improvement means nothing without knowing if it could be 0.5% or 3.5%.