Model Evaluation

Model Evaluation & Human Ratings Significance for AI Products

Statistical rigor for ML/AI evaluation: comparing model performance, analyzing human ratings, detecting drift, and making defensible decisions. A comprehensive guide for AI practitioners and product teams.

Share

Quick Hits

  • Model A beats Model B on 53% of examples—is that significant? (Often not, without proper analysis)
  • Human rater agreement matters: low agreement means noisy labels, inflated variance
  • Paired evaluation (same examples, both models) is more powerful than independent testing
  • Multiple metrics require multiple comparison corrections—or you'll false-positive yourself
  • Calibration matters: a model can have good accuracy but terrible probability estimates

TL;DR

Model evaluation in AI requires more than comparing metrics. You need statistical tests for significance, inter-rater reliability for human judgments, proper handling of multiple metrics, calibration assessment, and drift detection. This guide covers the complete framework: from comparing two models on win rate to evaluating complex systems with human raters, from single metrics to multi-dimensional quality assessment.


Why Statistical Rigor Matters

The Problem

"Model B wins on 54% of examples. Ship it!"

But:

  • 54% might not be statistically different from 50%
  • The examples might not represent production traffic
  • One metric improved, three others degraded
  • Human raters disagreed on 40% of examples
  • Next week's evaluation gives 48%

What Goes Wrong

Mistake Consequence
No significance test Ship random noise as "improvement"
Ignore rater disagreement Treat unreliable labels as ground truth
Multiple metrics, no correction False positive on at least one
No calibration check Model confident but wrong
No variance estimate Can't distinguish real from noise

Part 1: Comparing Two Models

Win Rate and Binomial Tests

The simplest comparison: Model A vs. Model B, which is better more often?

import numpy as np
from scipy import stats


def compare_models_winrate(wins_a, wins_b, ties=0):
    """
    Compare two models based on win rate (excluding ties).

    Parameters:
    -----------
    wins_a : int
        Number of examples where Model A wins
    wins_b : int
        Number of examples where Model B wins
    ties : int
        Number of ties (excluded from analysis)

    Returns:
    --------
    dict with win rates, CI, and significance test
    """
    total = wins_a + wins_b  # Excluding ties

    if total == 0:
        return {'error': 'No non-tied examples'}

    # Win rate for Model A
    p_a = wins_a / total

    # Binomial test: H0: p = 0.5 (models equally good)
    # Two-sided: either model could be better
    p_value = stats.binom_test(wins_a, total, p=0.5, alternative='two-sided')

    # Wilson score CI for win rate
    z = 1.96
    denominator = 1 + z**2 / total
    center = (p_a + z**2 / (2 * total)) / denominator
    margin = z * np.sqrt((p_a * (1 - p_a) + z**2 / (4 * total)) / total) / denominator

    ci_lower = center - margin
    ci_upper = center + margin

    return {
        'wins_a': wins_a,
        'wins_b': wins_b,
        'ties': ties,
        'total_compared': total,
        'win_rate_a': p_a,
        'win_rate_b': 1 - p_a,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'p_value': p_value,
        'significant': p_value < 0.05,
        'recommendation': 'A' if p_a > 0.5 and p_value < 0.05 else
                          'B' if p_a < 0.5 and p_value < 0.05 else 'No clear winner'
    }


# Example: LLM comparison
result = compare_models_winrate(wins_a=285, wins_b=250, ties=65)

print("Model Comparison: Win Rate Analysis")
print("=" * 50)
print(f"Model A wins: {result['wins_a']} ({result['win_rate_a']:.1%})")
print(f"Model B wins: {result['wins_b']} ({result['win_rate_b']:.1%})")
print(f"Ties: {result['ties']}")
print(f"\n95% CI for A's win rate: ({result['ci_lower']:.1%}, {result['ci_upper']:.1%})")
print(f"p-value (vs. 50%): {result['p_value']:.4f}")
print(f"Significant at α=0.05: {result['significant']}")
print(f"Recommendation: {result['recommendation']}")

Paired Evaluation with McNemar's Test

When the same examples are evaluated by both models, use paired analysis:

def mcnemar_test(both_correct, a_only, b_only, both_wrong):
    """
    McNemar's test for paired binary outcomes.

    Compares: (A correct, B wrong) vs. (A wrong, B correct)
    """
    # Discordant pairs
    n_discordant = a_only + b_only

    if n_discordant < 25:
        # Exact binomial for small samples
        p_value = stats.binom_test(a_only, n_discordant, p=0.5, alternative='two-sided')
    else:
        # Chi-squared approximation with continuity correction
        chi2 = (abs(a_only - b_only) - 1)**2 / (a_only + b_only)
        p_value = 1 - stats.chi2.cdf(chi2, df=1)

    return {
        'both_correct': both_correct,
        'a_only_correct': a_only,
        'b_only_correct': b_only,
        'both_wrong': both_wrong,
        'total': both_correct + a_only + b_only + both_wrong,
        'accuracy_a': (both_correct + a_only) / (both_correct + a_only + b_only + both_wrong),
        'accuracy_b': (both_correct + b_only) / (both_correct + a_only + b_only + both_wrong),
        'p_value': p_value,
        'significant': p_value < 0.05
    }


# Example: Classification models
result = mcnemar_test(both_correct=720, a_only=85, b_only=55, both_wrong=140)

print("McNemar's Test: Paired Classification Comparison")
print("=" * 50)
print(f"Both correct: {result['both_correct']}")
print(f"Only A correct: {result['a_only_correct']}")
print(f"Only B correct: {result['b_only_correct']}")
print(f"Both wrong: {result['both_wrong']}")
print(f"\nAccuracy A: {result['accuracy_a']:.1%}")
print(f"Accuracy B: {result['accuracy_b']:.1%}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant difference: {result['significant']}")

Bootstrap for Metric Differences

For continuous metrics (AUC, F1, BLEU), use bootstrap:

def bootstrap_metric_comparison(metric_a, metric_b, n_bootstrap=5000):
    """
    Bootstrap comparison of paired metrics.

    Parameters:
    -----------
    metric_a : array
        Per-example metric values for Model A
    metric_b : array
        Per-example metric values for Model B

    Returns comparison statistics.
    """
    n = len(metric_a)
    diff_observed = np.mean(metric_a) - np.mean(metric_b)

    # Bootstrap the difference
    boot_diffs = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, n, replace=True)
        boot_diff = np.mean(metric_a[idx]) - np.mean(metric_b[idx])
        boot_diffs.append(boot_diff)

    boot_diffs = np.array(boot_diffs)

    # CI and p-value
    ci = np.percentile(boot_diffs, [2.5, 97.5])

    # Two-sided p-value: proportion of bootstrap under null
    p_value = 2 * min(np.mean(boot_diffs <= 0), np.mean(boot_diffs >= 0))

    return {
        'mean_a': np.mean(metric_a),
        'mean_b': np.mean(metric_b),
        'difference': diff_observed,
        'se': np.std(boot_diffs),
        'ci_lower': ci[0],
        'ci_upper': ci[1],
        'p_value': min(p_value, 1.0),
        'significant': ci[0] > 0 or ci[1] < 0  # CI excludes 0
    }


# Example: BLEU scores
np.random.seed(42)
bleu_a = np.random.beta(8, 2, 500) * 100  # Model A BLEU scores
bleu_b = np.random.beta(7.5, 2, 500) * 100  # Model B slightly worse

result = bootstrap_metric_comparison(bleu_a, bleu_b)

print("Bootstrap Metric Comparison: BLEU Scores")
print("=" * 50)
print(f"Model A mean BLEU: {result['mean_a']:.2f}")
print(f"Model B mean BLEU: {result['mean_b']:.2f}")
print(f"Difference (A - B): {result['difference']:.2f}")
print(f"95% CI: ({result['ci_lower']:.2f}, {result['ci_upper']:.2f})")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant: {result['significant']}")

Part 2: Human Ratings and Agreement

Inter-Rater Reliability

Before trusting human ratings, measure how much raters agree.

def cohens_kappa(rater1, rater2):
    """
    Cohen's Kappa for two raters on categorical ratings.
    """
    # Confusion matrix
    categories = sorted(set(rater1) | set(rater2))
    n = len(rater1)

    # Observed agreement
    agree = sum(r1 == r2 for r1, r2 in zip(rater1, rater2))
    p_o = agree / n

    # Expected agreement by chance
    p_e = 0
    for cat in categories:
        p1 = sum(r == cat for r in rater1) / n
        p2 = sum(r == cat for r in rater2) / n
        p_e += p1 * p2

    # Kappa
    kappa = (p_o - p_e) / (1 - p_e) if p_e < 1 else 0

    return {
        'kappa': kappa,
        'observed_agreement': p_o,
        'expected_agreement': p_e,
        'interpretation': interpret_kappa(kappa)
    }


def interpret_kappa(kappa):
    """Standard kappa interpretation."""
    if kappa < 0:
        return "Poor (worse than chance)"
    elif kappa < 0.20:
        return "Slight"
    elif kappa < 0.40:
        return "Fair"
    elif kappa < 0.60:
        return "Moderate"
    elif kappa < 0.80:
        return "Substantial"
    else:
        return "Almost perfect"


# Example: Two raters evaluating response quality
np.random.seed(42)
n_examples = 200
categories = ['bad', 'okay', 'good']

# Simulate raters with moderate agreement
true_quality = np.random.choice(categories, n_examples, p=[0.2, 0.5, 0.3])
rater1 = [q if np.random.random() < 0.7 else np.random.choice(categories) for q in true_quality]
rater2 = [q if np.random.random() < 0.7 else np.random.choice(categories) for q in true_quality]

result = cohens_kappa(rater1, rater2)

print("Inter-Rater Reliability: Cohen's Kappa")
print("=" * 50)
print(f"Observed agreement: {result['observed_agreement']:.1%}")
print(f"Expected by chance: {result['expected_agreement']:.1%}")
print(f"Cohen's Kappa: {result['kappa']:.3f}")
print(f"Interpretation: {result['interpretation']}")

Krippendorff's Alpha (Multiple Raters)

For more than two raters:

def krippendorff_alpha(ratings_matrix, level='nominal'):
    """
    Krippendorff's Alpha for multiple raters.

    Parameters:
    -----------
    ratings_matrix : array
        Shape (n_raters, n_items), with NaN for missing
    level : str
        'nominal', 'ordinal', or 'interval'
    """
    # Flatten to pairs
    n_raters, n_items = ratings_matrix.shape

    # Observed disagreement
    observed_disagreement = 0
    n_pairs = 0

    for item in range(n_items):
        ratings = [r for r in ratings_matrix[:, item] if not np.isnan(r)]
        if len(ratings) < 2:
            continue

        for i in range(len(ratings)):
            for j in range(i + 1, len(ratings)):
                if level == 'nominal':
                    d = 0 if ratings[i] == ratings[j] else 1
                elif level == 'interval':
                    d = (ratings[i] - ratings[j]) ** 2
                else:
                    d = abs(ratings[i] - ratings[j])

                observed_disagreement += d
                n_pairs += 1

    if n_pairs == 0:
        return {'alpha': np.nan}

    D_o = observed_disagreement / n_pairs

    # Expected disagreement (across all ratings)
    all_ratings = ratings_matrix[~np.isnan(ratings_matrix)]
    n_total = len(all_ratings)

    expected_disagreement = 0
    n_expected_pairs = 0

    for i in range(n_total):
        for j in range(i + 1, n_total):
            if level == 'nominal':
                d = 0 if all_ratings[i] == all_ratings[j] else 1
            elif level == 'interval':
                d = (all_ratings[i] - all_ratings[j]) ** 2
            else:
                d = abs(all_ratings[i] - all_ratings[j])

            expected_disagreement += d
            n_expected_pairs += 1

    D_e = expected_disagreement / n_expected_pairs if n_expected_pairs > 0 else 0

    alpha = 1 - D_o / D_e if D_e > 0 else 0

    return {
        'alpha': alpha,
        'observed_disagreement': D_o,
        'expected_disagreement': D_e,
        'interpretation': interpret_kappa(alpha)  # Same scale
    }


# Example: Three raters on 1-5 scale
np.random.seed(42)
n_items = 100
n_raters = 3

# True scores
true_scores = np.random.randint(1, 6, n_items)

# Each rater adds noise
ratings = np.zeros((n_raters, n_items))
for r in range(n_raters):
    noise = np.random.randint(-1, 2, n_items)
    ratings[r, :] = np.clip(true_scores + noise, 1, 5)

result = krippendorff_alpha(ratings, level='interval')

print("Inter-Rater Reliability: Krippendorff's Alpha")
print("=" * 50)
print(f"Number of raters: {n_raters}")
print(f"Number of items: {n_items}")
print(f"Alpha (interval): {result['alpha']:.3f}")
print(f"Interpretation: {result['interpretation']}")

Part 3: Multiple Metrics and Comparisons

The Multiple Testing Problem

Testing 10 metrics at α=0.05 gives ~40% chance of at least one false positive.

def multiple_comparison_correction(p_values, method='holm'):
    """
    Adjust p-values for multiple comparisons.

    Methods:
    - bonferroni: Simple, conservative
    - holm: Less conservative, controls FWER
    - fdr: Controls false discovery rate (Benjamini-Hochberg)
    """
    n = len(p_values)
    p_values = np.array(p_values)

    if method == 'bonferroni':
        adjusted = np.minimum(p_values * n, 1.0)

    elif method == 'holm':
        # Sort p-values
        sorted_idx = np.argsort(p_values)
        adjusted = np.zeros(n)

        for i, idx in enumerate(sorted_idx):
            multiplier = n - i
            adjusted[idx] = min(p_values[idx] * multiplier, 1.0)

        # Enforce monotonicity
        for i in range(1, n):
            idx = sorted_idx[i]
            prev_idx = sorted_idx[i-1]
            adjusted[idx] = max(adjusted[idx], adjusted[prev_idx])

    elif method == 'fdr':
        # Benjamini-Hochberg
        sorted_idx = np.argsort(p_values)
        adjusted = np.zeros(n)

        for i, idx in enumerate(sorted_idx):
            adjusted[idx] = min(p_values[idx] * n / (i + 1), 1.0)

        # Enforce monotonicity (backwards)
        for i in range(n - 2, -1, -1):
            idx = sorted_idx[i]
            next_idx = sorted_idx[i+1]
            adjusted[idx] = min(adjusted[idx], adjusted[next_idx])

    return adjusted


# Example: Evaluating model across 8 metrics
np.random.seed(42)
metrics = ['Accuracy', 'F1', 'Precision', 'Recall', 'AUC', 'BLEU', 'ROUGE', 'Perplexity']
p_values = [0.03, 0.01, 0.08, 0.15, 0.02, 0.04, 0.25, 0.45]

print("Multiple Comparison Correction")
print("=" * 60)
print(f"{'Metric':<12} {'Raw p':>10} {'Bonferroni':>12} {'Holm':>10} {'FDR':>10}")
print("-" * 60)

bonf = multiple_comparison_correction(p_values, 'bonferroni')
holm = multiple_comparison_correction(p_values, 'holm')
fdr = multiple_comparison_correction(p_values, 'fdr')

for i, metric in enumerate(metrics):
    print(f"{metric:<12} {p_values[i]:>10.3f} {bonf[i]:>12.3f} {holm[i]:>10.3f} {fdr[i]:>10.3f}")

print("\nSignificant at α=0.05:")
print(f"  Raw: {sum(p < 0.05 for p in p_values)} metrics")
print(f"  Bonferroni: {sum(p < 0.05 for p in bonf)} metrics")
print(f"  Holm: {sum(p < 0.05 for p in holm)} metrics")
print(f"  FDR: {sum(p < 0.05 for p in fdr)} metrics")

Part 4: Calibration and Reliability

Calibration Assessment

A model's confidence should match its accuracy:

def calibration_analysis(predicted_probs, true_labels, n_bins=10):
    """
    Analyze model calibration.

    Returns ECE, reliability diagram data, and Brier score.
    """
    bins = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(predicted_probs, bins) - 1
    bin_indices = np.clip(bin_indices, 0, n_bins - 1)

    bin_accuracies = []
    bin_confidences = []
    bin_counts = []

    for i in range(n_bins):
        mask = bin_indices == i
        if mask.sum() > 0:
            bin_acc = true_labels[mask].mean()
            bin_conf = predicted_probs[mask].mean()
            bin_count = mask.sum()
        else:
            bin_acc = np.nan
            bin_conf = (bins[i] + bins[i+1]) / 2
            bin_count = 0

        bin_accuracies.append(bin_acc)
        bin_confidences.append(bin_conf)
        bin_counts.append(bin_count)

    # Expected Calibration Error
    ece = 0
    total = sum(bin_counts)
    for acc, conf, count in zip(bin_accuracies, bin_confidences, bin_counts):
        if not np.isnan(acc):
            ece += (count / total) * abs(acc - conf)

    # Brier score
    brier = np.mean((predicted_probs - true_labels) ** 2)

    return {
        'ece': ece,
        'brier_score': brier,
        'bin_edges': bins,
        'bin_accuracies': bin_accuracies,
        'bin_confidences': bin_confidences,
        'bin_counts': bin_counts
    }


# Example: Comparing well-calibrated vs. overconfident model
np.random.seed(42)
n = 1000

# True labels
true_labels = np.random.binomial(1, 0.4, n)

# Well-calibrated model
calibrated_probs = true_labels * np.random.beta(8, 2, n) + (1 - true_labels) * np.random.beta(2, 8, n)
calibrated_probs = np.clip(calibrated_probs, 0.01, 0.99)

# Overconfident model
overconfident_probs = np.where(calibrated_probs > 0.5,
                               0.5 + (calibrated_probs - 0.5) * 1.5,
                               0.5 - (0.5 - calibrated_probs) * 1.5)
overconfident_probs = np.clip(overconfident_probs, 0.01, 0.99)

cal_result = calibration_analysis(calibrated_probs, true_labels)
over_result = calibration_analysis(overconfident_probs, true_labels)

print("Calibration Analysis")
print("=" * 50)
print("\nWell-Calibrated Model:")
print(f"  ECE: {cal_result['ece']:.4f}")
print(f"  Brier Score: {cal_result['brier_score']:.4f}")

print("\nOverconfident Model:")
print(f"  ECE: {over_result['ece']:.4f}")
print(f"  Brier Score: {over_result['brier_score']:.4f}")

Part 5: Drift Detection

Detecting Distribution Shift

def ks_drift_test(reference_scores, current_scores, threshold=0.05):
    """
    Kolmogorov-Smirnov test for distribution drift.
    """
    statistic, p_value = stats.ks_2samp(reference_scores, current_scores)

    return {
        'ks_statistic': statistic,
        'p_value': p_value,
        'drift_detected': p_value < threshold,
        'interpretation': f"{'Significant' if p_value < threshold else 'No significant'} drift detected"
    }


def psi_drift(reference, current, n_bins=10):
    """
    Population Stability Index for drift detection.

    PSI < 0.1: No significant change
    PSI 0.1-0.25: Moderate change, investigate
    PSI > 0.25: Significant change, action needed
    """
    # Bin edges from reference
    bins = np.percentile(reference, np.linspace(0, 100, n_bins + 1))
    bins[0] = -np.inf
    bins[-1] = np.inf

    # Count proportions
    ref_counts = np.histogram(reference, bins)[0] / len(reference)
    cur_counts = np.histogram(current, bins)[0] / len(current)

    # Avoid zeros
    ref_counts = np.maximum(ref_counts, 0.001)
    cur_counts = np.maximum(cur_counts, 0.001)

    # PSI
    psi = np.sum((cur_counts - ref_counts) * np.log(cur_counts / ref_counts))

    if psi < 0.1:
        interpretation = "No significant change"
    elif psi < 0.25:
        interpretation = "Moderate change - investigate"
    else:
        interpretation = "Significant change - action needed"

    return {
        'psi': psi,
        'interpretation': interpretation
    }


# Example: Monitoring model scores over time
np.random.seed(42)

# Reference period (training data)
reference = np.random.normal(0.7, 0.15, 1000)
reference = np.clip(reference, 0, 1)

# Current period (slight drift)
current = np.random.normal(0.65, 0.18, 500)  # Lower mean, higher variance
current = np.clip(current, 0, 1)

ks_result = ks_drift_test(reference, current)
psi_result = psi_drift(reference, current)

print("Drift Detection Analysis")
print("=" * 50)
print(f"Reference: n={len(reference)}, mean={np.mean(reference):.3f}, std={np.std(reference):.3f}")
print(f"Current: n={len(current)}, mean={np.mean(current):.3f}, std={np.std(current):.3f}")
print(f"\nKS Test:")
print(f"  Statistic: {ks_result['ks_statistic']:.4f}")
print(f"  p-value: {ks_result['p_value']:.4f}")
print(f"  {ks_result['interpretation']}")
print(f"\nPSI:")
print(f"  Value: {psi_result['psi']:.4f}")
print(f"  {psi_result['interpretation']}")

Part 6: Practical Sample Size

Power Analysis for Model Comparison

def sample_size_winrate(baseline_winrate=0.5, effect_size=0.05, power=0.8, alpha=0.05):
    """
    Sample size needed to detect a win rate difference.

    Parameters:
    -----------
    baseline_winrate : float
        Expected win rate under null (usually 0.5)
    effect_size : float
        Minimum detectable difference (e.g., 0.05 = 55% vs 45%)
    power : float
        Desired power (e.g., 0.8)
    alpha : float
        Significance level
    """
    from scipy.stats import norm

    p1 = baseline_winrate + effect_size / 2
    p2 = baseline_winrate - effect_size / 2

    # Pooled proportion (under null)
    p_pool = baseline_winrate

    # Z values
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)

    # Sample size formula (per group, but it's paired so same n)
    numerator = (z_alpha * np.sqrt(2 * p_pool * (1 - p_pool)) +
                 z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
    denominator = (p1 - p2) ** 2

    n = numerator / denominator

    return int(np.ceil(n))


# Example: How many examples to detect 5% win rate difference?
for effect in [0.02, 0.05, 0.10, 0.15]:
    n = sample_size_winrate(effect_size=effect)
    print(f"Detect {effect/2:.0%} vs {1-effect/2:.0%} win rate: n={n:,} examples")

Summary: The Evaluation Checklist

Before Evaluation

  • Define success criteria (what improvement is meaningful?)
  • Choose appropriate test (paired vs unpaired, win rate vs metric)
  • Determine sample size via power analysis
  • Plan for multiple comparisons if testing many metrics
  • Train raters and measure inter-rater agreement

During Evaluation

  • Use paired evaluation when possible (same examples, both models)
  • Randomize presentation order to avoid bias
  • Track rater agreement throughout
  • Monitor for evaluation set drift

After Evaluation

  • Compute confidence intervals, not just point estimates
  • Apply multiple comparison correction if needed
  • Check calibration for probability outputs
  • Report uncertainty: "A beats B on 54% ± 3% of examples"
  • Document methodology for reproducibility

Specific Methods

Quality Assurance


Key Takeaway

Model evaluation requires statistical rigor. A 54% win rate doesn't mean Model A is better—you need confidence intervals and significance tests. Human ratings are only useful if raters agree—measure reliability before trusting labels. Multiple metrics require multiple comparison corrections—or you'll false-positive yourself. Calibration matters separately from accuracy—overconfident models fail silently. Build evaluation as a discipline: plan sample sizes, measure agreement, quantify uncertainty, and report what you don't know alongside what you do.


References

  1. https://aclanthology.org/2020.acl-main.442/
  2. https://arxiv.org/abs/2303.16634
  3. https://www.jmlr.org/papers/v7/demsar06a.html
  4. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. *Journal of Machine Learning Research*, 7, 1-30.
  5. Card, D., Henderson, P., Khandelwal, U., & Levy, R. (2020). With little power comes great responsibility. *ACL*, 3182-3193.
  6. Krippendorff, K. (2004). *Content Analysis: An Introduction to Its Methodology* (2nd ed.). Sage Publications.

Frequently Asked Questions

How many examples do I need to compare two models?
Depends on the effect size and variance. For win rates around 50%, detecting a 5% difference (52.5% vs 47.5%) needs ~1500 paired examples. For clearer differences (55% vs 45%), ~400 examples suffice. Use power analysis with your pilot data's variance.
Should I use human raters or automated metrics?
Both serve different purposes. Automated metrics (BLEU, ROUGE, accuracy) are cheap and reproducible but may not capture what users care about. Human ratings capture nuanced quality but are expensive and noisy. Use automated metrics for development iteration, human evaluation for launch decisions.
How do I handle disagreement between raters?
First, measure agreement (Cohen's kappa, Krippendorff's alpha). Low agreement (<0.4) suggests unclear criteria or genuinely ambiguous examples. Improve guidelines, add calibration sessions, or accept that some examples have no 'right' answer. For analysis, use majority vote, weighted average, or model the disagreement explicitly.

Key Takeaway

Evaluating AI models requires statistical rigor beyond simple metric comparisons. Use paired evaluation when possible, account for rater disagreement, correct for multiple comparisons, check calibration, and always quantify uncertainty. A 2% improvement with p=0.001 is real; a 5% improvement with p=0.2 is noise. Without proper statistical analysis, you'll either ship bad models or fail to ship good ones.

Send to a friend

Share this with someone who loves clean statistical work.