Contents
Model Evaluation & Human Ratings Significance for AI Products
Statistical rigor for ML/AI evaluation: comparing model performance, analyzing human ratings, detecting drift, and making defensible decisions. A comprehensive guide for AI practitioners and product teams.
Quick Hits
- •Model A beats Model B on 53% of examples—is that significant? (Often not, without proper analysis)
- •Human rater agreement matters: low agreement means noisy labels, inflated variance
- •Paired evaluation (same examples, both models) is more powerful than independent testing
- •Multiple metrics require multiple comparison corrections—or you'll false-positive yourself
- •Calibration matters: a model can have good accuracy but terrible probability estimates
TL;DR
Model evaluation in AI requires more than comparing metrics. You need statistical tests for significance, inter-rater reliability for human judgments, proper handling of multiple metrics, calibration assessment, and drift detection. This guide covers the complete framework: from comparing two models on win rate to evaluating complex systems with human raters, from single metrics to multi-dimensional quality assessment.
Why Statistical Rigor Matters
The Problem
"Model B wins on 54% of examples. Ship it!"
But:
- 54% might not be statistically different from 50%
- The examples might not represent production traffic
- One metric improved, three others degraded
- Human raters disagreed on 40% of examples
- Next week's evaluation gives 48%
What Goes Wrong
| Mistake | Consequence |
|---|---|
| No significance test | Ship random noise as "improvement" |
| Ignore rater disagreement | Treat unreliable labels as ground truth |
| Multiple metrics, no correction | False positive on at least one |
| No calibration check | Model confident but wrong |
| No variance estimate | Can't distinguish real from noise |
Part 1: Comparing Two Models
Win Rate and Binomial Tests
The simplest comparison: Model A vs. Model B, which is better more often?
import numpy as np
from scipy import stats
def compare_models_winrate(wins_a, wins_b, ties=0):
"""
Compare two models based on win rate (excluding ties).
Parameters:
-----------
wins_a : int
Number of examples where Model A wins
wins_b : int
Number of examples where Model B wins
ties : int
Number of ties (excluded from analysis)
Returns:
--------
dict with win rates, CI, and significance test
"""
total = wins_a + wins_b # Excluding ties
if total == 0:
return {'error': 'No non-tied examples'}
# Win rate for Model A
p_a = wins_a / total
# Binomial test: H0: p = 0.5 (models equally good)
# Two-sided: either model could be better
p_value = stats.binom_test(wins_a, total, p=0.5, alternative='two-sided')
# Wilson score CI for win rate
z = 1.96
denominator = 1 + z**2 / total
center = (p_a + z**2 / (2 * total)) / denominator
margin = z * np.sqrt((p_a * (1 - p_a) + z**2 / (4 * total)) / total) / denominator
ci_lower = center - margin
ci_upper = center + margin
return {
'wins_a': wins_a,
'wins_b': wins_b,
'ties': ties,
'total_compared': total,
'win_rate_a': p_a,
'win_rate_b': 1 - p_a,
'ci_lower': ci_lower,
'ci_upper': ci_upper,
'p_value': p_value,
'significant': p_value < 0.05,
'recommendation': 'A' if p_a > 0.5 and p_value < 0.05 else
'B' if p_a < 0.5 and p_value < 0.05 else 'No clear winner'
}
# Example: LLM comparison
result = compare_models_winrate(wins_a=285, wins_b=250, ties=65)
print("Model Comparison: Win Rate Analysis")
print("=" * 50)
print(f"Model A wins: {result['wins_a']} ({result['win_rate_a']:.1%})")
print(f"Model B wins: {result['wins_b']} ({result['win_rate_b']:.1%})")
print(f"Ties: {result['ties']}")
print(f"\n95% CI for A's win rate: ({result['ci_lower']:.1%}, {result['ci_upper']:.1%})")
print(f"p-value (vs. 50%): {result['p_value']:.4f}")
print(f"Significant at α=0.05: {result['significant']}")
print(f"Recommendation: {result['recommendation']}")
Paired Evaluation with McNemar's Test
When the same examples are evaluated by both models, use paired analysis:
def mcnemar_test(both_correct, a_only, b_only, both_wrong):
"""
McNemar's test for paired binary outcomes.
Compares: (A correct, B wrong) vs. (A wrong, B correct)
"""
# Discordant pairs
n_discordant = a_only + b_only
if n_discordant < 25:
# Exact binomial for small samples
p_value = stats.binom_test(a_only, n_discordant, p=0.5, alternative='two-sided')
else:
# Chi-squared approximation with continuity correction
chi2 = (abs(a_only - b_only) - 1)**2 / (a_only + b_only)
p_value = 1 - stats.chi2.cdf(chi2, df=1)
return {
'both_correct': both_correct,
'a_only_correct': a_only,
'b_only_correct': b_only,
'both_wrong': both_wrong,
'total': both_correct + a_only + b_only + both_wrong,
'accuracy_a': (both_correct + a_only) / (both_correct + a_only + b_only + both_wrong),
'accuracy_b': (both_correct + b_only) / (both_correct + a_only + b_only + both_wrong),
'p_value': p_value,
'significant': p_value < 0.05
}
# Example: Classification models
result = mcnemar_test(both_correct=720, a_only=85, b_only=55, both_wrong=140)
print("McNemar's Test: Paired Classification Comparison")
print("=" * 50)
print(f"Both correct: {result['both_correct']}")
print(f"Only A correct: {result['a_only_correct']}")
print(f"Only B correct: {result['b_only_correct']}")
print(f"Both wrong: {result['both_wrong']}")
print(f"\nAccuracy A: {result['accuracy_a']:.1%}")
print(f"Accuracy B: {result['accuracy_b']:.1%}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant difference: {result['significant']}")
Bootstrap for Metric Differences
For continuous metrics (AUC, F1, BLEU), use bootstrap:
def bootstrap_metric_comparison(metric_a, metric_b, n_bootstrap=5000):
"""
Bootstrap comparison of paired metrics.
Parameters:
-----------
metric_a : array
Per-example metric values for Model A
metric_b : array
Per-example metric values for Model B
Returns comparison statistics.
"""
n = len(metric_a)
diff_observed = np.mean(metric_a) - np.mean(metric_b)
# Bootstrap the difference
boot_diffs = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, n, replace=True)
boot_diff = np.mean(metric_a[idx]) - np.mean(metric_b[idx])
boot_diffs.append(boot_diff)
boot_diffs = np.array(boot_diffs)
# CI and p-value
ci = np.percentile(boot_diffs, [2.5, 97.5])
# Two-sided p-value: proportion of bootstrap under null
p_value = 2 * min(np.mean(boot_diffs <= 0), np.mean(boot_diffs >= 0))
return {
'mean_a': np.mean(metric_a),
'mean_b': np.mean(metric_b),
'difference': diff_observed,
'se': np.std(boot_diffs),
'ci_lower': ci[0],
'ci_upper': ci[1],
'p_value': min(p_value, 1.0),
'significant': ci[0] > 0 or ci[1] < 0 # CI excludes 0
}
# Example: BLEU scores
np.random.seed(42)
bleu_a = np.random.beta(8, 2, 500) * 100 # Model A BLEU scores
bleu_b = np.random.beta(7.5, 2, 500) * 100 # Model B slightly worse
result = bootstrap_metric_comparison(bleu_a, bleu_b)
print("Bootstrap Metric Comparison: BLEU Scores")
print("=" * 50)
print(f"Model A mean BLEU: {result['mean_a']:.2f}")
print(f"Model B mean BLEU: {result['mean_b']:.2f}")
print(f"Difference (A - B): {result['difference']:.2f}")
print(f"95% CI: ({result['ci_lower']:.2f}, {result['ci_upper']:.2f})")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant: {result['significant']}")
Part 2: Human Ratings and Agreement
Inter-Rater Reliability
Before trusting human ratings, measure how much raters agree.
def cohens_kappa(rater1, rater2):
"""
Cohen's Kappa for two raters on categorical ratings.
"""
# Confusion matrix
categories = sorted(set(rater1) | set(rater2))
n = len(rater1)
# Observed agreement
agree = sum(r1 == r2 for r1, r2 in zip(rater1, rater2))
p_o = agree / n
# Expected agreement by chance
p_e = 0
for cat in categories:
p1 = sum(r == cat for r in rater1) / n
p2 = sum(r == cat for r in rater2) / n
p_e += p1 * p2
# Kappa
kappa = (p_o - p_e) / (1 - p_e) if p_e < 1 else 0
return {
'kappa': kappa,
'observed_agreement': p_o,
'expected_agreement': p_e,
'interpretation': interpret_kappa(kappa)
}
def interpret_kappa(kappa):
"""Standard kappa interpretation."""
if kappa < 0:
return "Poor (worse than chance)"
elif kappa < 0.20:
return "Slight"
elif kappa < 0.40:
return "Fair"
elif kappa < 0.60:
return "Moderate"
elif kappa < 0.80:
return "Substantial"
else:
return "Almost perfect"
# Example: Two raters evaluating response quality
np.random.seed(42)
n_examples = 200
categories = ['bad', 'okay', 'good']
# Simulate raters with moderate agreement
true_quality = np.random.choice(categories, n_examples, p=[0.2, 0.5, 0.3])
rater1 = [q if np.random.random() < 0.7 else np.random.choice(categories) for q in true_quality]
rater2 = [q if np.random.random() < 0.7 else np.random.choice(categories) for q in true_quality]
result = cohens_kappa(rater1, rater2)
print("Inter-Rater Reliability: Cohen's Kappa")
print("=" * 50)
print(f"Observed agreement: {result['observed_agreement']:.1%}")
print(f"Expected by chance: {result['expected_agreement']:.1%}")
print(f"Cohen's Kappa: {result['kappa']:.3f}")
print(f"Interpretation: {result['interpretation']}")
Krippendorff's Alpha (Multiple Raters)
For more than two raters:
def krippendorff_alpha(ratings_matrix, level='nominal'):
"""
Krippendorff's Alpha for multiple raters.
Parameters:
-----------
ratings_matrix : array
Shape (n_raters, n_items), with NaN for missing
level : str
'nominal', 'ordinal', or 'interval'
"""
# Flatten to pairs
n_raters, n_items = ratings_matrix.shape
# Observed disagreement
observed_disagreement = 0
n_pairs = 0
for item in range(n_items):
ratings = [r for r in ratings_matrix[:, item] if not np.isnan(r)]
if len(ratings) < 2:
continue
for i in range(len(ratings)):
for j in range(i + 1, len(ratings)):
if level == 'nominal':
d = 0 if ratings[i] == ratings[j] else 1
elif level == 'interval':
d = (ratings[i] - ratings[j]) ** 2
else:
d = abs(ratings[i] - ratings[j])
observed_disagreement += d
n_pairs += 1
if n_pairs == 0:
return {'alpha': np.nan}
D_o = observed_disagreement / n_pairs
# Expected disagreement (across all ratings)
all_ratings = ratings_matrix[~np.isnan(ratings_matrix)]
n_total = len(all_ratings)
expected_disagreement = 0
n_expected_pairs = 0
for i in range(n_total):
for j in range(i + 1, n_total):
if level == 'nominal':
d = 0 if all_ratings[i] == all_ratings[j] else 1
elif level == 'interval':
d = (all_ratings[i] - all_ratings[j]) ** 2
else:
d = abs(all_ratings[i] - all_ratings[j])
expected_disagreement += d
n_expected_pairs += 1
D_e = expected_disagreement / n_expected_pairs if n_expected_pairs > 0 else 0
alpha = 1 - D_o / D_e if D_e > 0 else 0
return {
'alpha': alpha,
'observed_disagreement': D_o,
'expected_disagreement': D_e,
'interpretation': interpret_kappa(alpha) # Same scale
}
# Example: Three raters on 1-5 scale
np.random.seed(42)
n_items = 100
n_raters = 3
# True scores
true_scores = np.random.randint(1, 6, n_items)
# Each rater adds noise
ratings = np.zeros((n_raters, n_items))
for r in range(n_raters):
noise = np.random.randint(-1, 2, n_items)
ratings[r, :] = np.clip(true_scores + noise, 1, 5)
result = krippendorff_alpha(ratings, level='interval')
print("Inter-Rater Reliability: Krippendorff's Alpha")
print("=" * 50)
print(f"Number of raters: {n_raters}")
print(f"Number of items: {n_items}")
print(f"Alpha (interval): {result['alpha']:.3f}")
print(f"Interpretation: {result['interpretation']}")
Part 3: Multiple Metrics and Comparisons
The Multiple Testing Problem
Testing 10 metrics at α=0.05 gives ~40% chance of at least one false positive.
def multiple_comparison_correction(p_values, method='holm'):
"""
Adjust p-values for multiple comparisons.
Methods:
- bonferroni: Simple, conservative
- holm: Less conservative, controls FWER
- fdr: Controls false discovery rate (Benjamini-Hochberg)
"""
n = len(p_values)
p_values = np.array(p_values)
if method == 'bonferroni':
adjusted = np.minimum(p_values * n, 1.0)
elif method == 'holm':
# Sort p-values
sorted_idx = np.argsort(p_values)
adjusted = np.zeros(n)
for i, idx in enumerate(sorted_idx):
multiplier = n - i
adjusted[idx] = min(p_values[idx] * multiplier, 1.0)
# Enforce monotonicity
for i in range(1, n):
idx = sorted_idx[i]
prev_idx = sorted_idx[i-1]
adjusted[idx] = max(adjusted[idx], adjusted[prev_idx])
elif method == 'fdr':
# Benjamini-Hochberg
sorted_idx = np.argsort(p_values)
adjusted = np.zeros(n)
for i, idx in enumerate(sorted_idx):
adjusted[idx] = min(p_values[idx] * n / (i + 1), 1.0)
# Enforce monotonicity (backwards)
for i in range(n - 2, -1, -1):
idx = sorted_idx[i]
next_idx = sorted_idx[i+1]
adjusted[idx] = min(adjusted[idx], adjusted[next_idx])
return adjusted
# Example: Evaluating model across 8 metrics
np.random.seed(42)
metrics = ['Accuracy', 'F1', 'Precision', 'Recall', 'AUC', 'BLEU', 'ROUGE', 'Perplexity']
p_values = [0.03, 0.01, 0.08, 0.15, 0.02, 0.04, 0.25, 0.45]
print("Multiple Comparison Correction")
print("=" * 60)
print(f"{'Metric':<12} {'Raw p':>10} {'Bonferroni':>12} {'Holm':>10} {'FDR':>10}")
print("-" * 60)
bonf = multiple_comparison_correction(p_values, 'bonferroni')
holm = multiple_comparison_correction(p_values, 'holm')
fdr = multiple_comparison_correction(p_values, 'fdr')
for i, metric in enumerate(metrics):
print(f"{metric:<12} {p_values[i]:>10.3f} {bonf[i]:>12.3f} {holm[i]:>10.3f} {fdr[i]:>10.3f}")
print("\nSignificant at α=0.05:")
print(f" Raw: {sum(p < 0.05 for p in p_values)} metrics")
print(f" Bonferroni: {sum(p < 0.05 for p in bonf)} metrics")
print(f" Holm: {sum(p < 0.05 for p in holm)} metrics")
print(f" FDR: {sum(p < 0.05 for p in fdr)} metrics")
Part 4: Calibration and Reliability
Calibration Assessment
A model's confidence should match its accuracy:
def calibration_analysis(predicted_probs, true_labels, n_bins=10):
"""
Analyze model calibration.
Returns ECE, reliability diagram data, and Brier score.
"""
bins = np.linspace(0, 1, n_bins + 1)
bin_indices = np.digitize(predicted_probs, bins) - 1
bin_indices = np.clip(bin_indices, 0, n_bins - 1)
bin_accuracies = []
bin_confidences = []
bin_counts = []
for i in range(n_bins):
mask = bin_indices == i
if mask.sum() > 0:
bin_acc = true_labels[mask].mean()
bin_conf = predicted_probs[mask].mean()
bin_count = mask.sum()
else:
bin_acc = np.nan
bin_conf = (bins[i] + bins[i+1]) / 2
bin_count = 0
bin_accuracies.append(bin_acc)
bin_confidences.append(bin_conf)
bin_counts.append(bin_count)
# Expected Calibration Error
ece = 0
total = sum(bin_counts)
for acc, conf, count in zip(bin_accuracies, bin_confidences, bin_counts):
if not np.isnan(acc):
ece += (count / total) * abs(acc - conf)
# Brier score
brier = np.mean((predicted_probs - true_labels) ** 2)
return {
'ece': ece,
'brier_score': brier,
'bin_edges': bins,
'bin_accuracies': bin_accuracies,
'bin_confidences': bin_confidences,
'bin_counts': bin_counts
}
# Example: Comparing well-calibrated vs. overconfident model
np.random.seed(42)
n = 1000
# True labels
true_labels = np.random.binomial(1, 0.4, n)
# Well-calibrated model
calibrated_probs = true_labels * np.random.beta(8, 2, n) + (1 - true_labels) * np.random.beta(2, 8, n)
calibrated_probs = np.clip(calibrated_probs, 0.01, 0.99)
# Overconfident model
overconfident_probs = np.where(calibrated_probs > 0.5,
0.5 + (calibrated_probs - 0.5) * 1.5,
0.5 - (0.5 - calibrated_probs) * 1.5)
overconfident_probs = np.clip(overconfident_probs, 0.01, 0.99)
cal_result = calibration_analysis(calibrated_probs, true_labels)
over_result = calibration_analysis(overconfident_probs, true_labels)
print("Calibration Analysis")
print("=" * 50)
print("\nWell-Calibrated Model:")
print(f" ECE: {cal_result['ece']:.4f}")
print(f" Brier Score: {cal_result['brier_score']:.4f}")
print("\nOverconfident Model:")
print(f" ECE: {over_result['ece']:.4f}")
print(f" Brier Score: {over_result['brier_score']:.4f}")
Part 5: Drift Detection
Detecting Distribution Shift
def ks_drift_test(reference_scores, current_scores, threshold=0.05):
"""
Kolmogorov-Smirnov test for distribution drift.
"""
statistic, p_value = stats.ks_2samp(reference_scores, current_scores)
return {
'ks_statistic': statistic,
'p_value': p_value,
'drift_detected': p_value < threshold,
'interpretation': f"{'Significant' if p_value < threshold else 'No significant'} drift detected"
}
def psi_drift(reference, current, n_bins=10):
"""
Population Stability Index for drift detection.
PSI < 0.1: No significant change
PSI 0.1-0.25: Moderate change, investigate
PSI > 0.25: Significant change, action needed
"""
# Bin edges from reference
bins = np.percentile(reference, np.linspace(0, 100, n_bins + 1))
bins[0] = -np.inf
bins[-1] = np.inf
# Count proportions
ref_counts = np.histogram(reference, bins)[0] / len(reference)
cur_counts = np.histogram(current, bins)[0] / len(current)
# Avoid zeros
ref_counts = np.maximum(ref_counts, 0.001)
cur_counts = np.maximum(cur_counts, 0.001)
# PSI
psi = np.sum((cur_counts - ref_counts) * np.log(cur_counts / ref_counts))
if psi < 0.1:
interpretation = "No significant change"
elif psi < 0.25:
interpretation = "Moderate change - investigate"
else:
interpretation = "Significant change - action needed"
return {
'psi': psi,
'interpretation': interpretation
}
# Example: Monitoring model scores over time
np.random.seed(42)
# Reference period (training data)
reference = np.random.normal(0.7, 0.15, 1000)
reference = np.clip(reference, 0, 1)
# Current period (slight drift)
current = np.random.normal(0.65, 0.18, 500) # Lower mean, higher variance
current = np.clip(current, 0, 1)
ks_result = ks_drift_test(reference, current)
psi_result = psi_drift(reference, current)
print("Drift Detection Analysis")
print("=" * 50)
print(f"Reference: n={len(reference)}, mean={np.mean(reference):.3f}, std={np.std(reference):.3f}")
print(f"Current: n={len(current)}, mean={np.mean(current):.3f}, std={np.std(current):.3f}")
print(f"\nKS Test:")
print(f" Statistic: {ks_result['ks_statistic']:.4f}")
print(f" p-value: {ks_result['p_value']:.4f}")
print(f" {ks_result['interpretation']}")
print(f"\nPSI:")
print(f" Value: {psi_result['psi']:.4f}")
print(f" {psi_result['interpretation']}")
Part 6: Practical Sample Size
Power Analysis for Model Comparison
def sample_size_winrate(baseline_winrate=0.5, effect_size=0.05, power=0.8, alpha=0.05):
"""
Sample size needed to detect a win rate difference.
Parameters:
-----------
baseline_winrate : float
Expected win rate under null (usually 0.5)
effect_size : float
Minimum detectable difference (e.g., 0.05 = 55% vs 45%)
power : float
Desired power (e.g., 0.8)
alpha : float
Significance level
"""
from scipy.stats import norm
p1 = baseline_winrate + effect_size / 2
p2 = baseline_winrate - effect_size / 2
# Pooled proportion (under null)
p_pool = baseline_winrate
# Z values
z_alpha = norm.ppf(1 - alpha / 2)
z_beta = norm.ppf(power)
# Sample size formula (per group, but it's paired so same n)
numerator = (z_alpha * np.sqrt(2 * p_pool * (1 - p_pool)) +
z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
denominator = (p1 - p2) ** 2
n = numerator / denominator
return int(np.ceil(n))
# Example: How many examples to detect 5% win rate difference?
for effect in [0.02, 0.05, 0.10, 0.15]:
n = sample_size_winrate(effect_size=effect)
print(f"Detect {effect/2:.0%} vs {1-effect/2:.0%} win rate: n={n:,} examples")
Summary: The Evaluation Checklist
Before Evaluation
- Define success criteria (what improvement is meaningful?)
- Choose appropriate test (paired vs unpaired, win rate vs metric)
- Determine sample size via power analysis
- Plan for multiple comparisons if testing many metrics
- Train raters and measure inter-rater agreement
During Evaluation
- Use paired evaluation when possible (same examples, both models)
- Randomize presentation order to avoid bias
- Track rater agreement throughout
- Monitor for evaluation set drift
After Evaluation
- Compute confidence intervals, not just point estimates
- Apply multiple comparison correction if needed
- Check calibration for probability outputs
- Report uncertainty: "A beats B on 54% ± 3% of examples"
- Document methodology for reproducibility
Related Articles
Specific Methods
- Comparing Models: Win Rate and Binomial CI - Win rate analysis
- Inter-Rater Reliability - Agreement metrics
- Paired Evaluation: McNemar's Test - Paired comparisons
- Bootstrap for Metric Deltas - Metric uncertainty
Quality Assurance
- Multiple Metrics and False Discoveries - Multiple testing
- Calibration Checks - Probability calibration
- Drift Detection - Distribution monitoring
- Meaningful vs. Significant - Practical significance
Key Takeaway
Model evaluation requires statistical rigor. A 54% win rate doesn't mean Model A is better—you need confidence intervals and significance tests. Human ratings are only useful if raters agree—measure reliability before trusting labels. Multiple metrics require multiple comparison corrections—or you'll false-positive yourself. Calibration matters separately from accuracy—overconfident models fail silently. Build evaluation as a discipline: plan sample sizes, measure agreement, quantify uncertainty, and report what you don't know alongside what you do.
References
- https://aclanthology.org/2020.acl-main.442/
- https://arxiv.org/abs/2303.16634
- https://www.jmlr.org/papers/v7/demsar06a.html
- Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. *Journal of Machine Learning Research*, 7, 1-30.
- Card, D., Henderson, P., Khandelwal, U., & Levy, R. (2020). With little power comes great responsibility. *ACL*, 3182-3193.
- Krippendorff, K. (2004). *Content Analysis: An Introduction to Its Methodology* (2nd ed.). Sage Publications.
Frequently Asked Questions
How many examples do I need to compare two models?
Should I use human raters or automated metrics?
How do I handle disagreement between raters?
Key Takeaway
Evaluating AI models requires statistical rigor beyond simple metric comparisons. Use paired evaluation when possible, account for rater disagreement, correct for multiple comparisons, check calibration, and always quantify uncertainty. A 2% improvement with p=0.001 is real; a 5% improvement with p=0.2 is noise. Without proper statistical analysis, you'll either ship bad models or fail to ship good ones.