Model Evaluation

Inter-Rater Reliability: Cohen's Kappa and Krippendorff's Alpha

How to measure agreement between human raters for AI evaluation. Learn when to use Cohen's Kappa vs. Krippendorff's Alpha, how to interpret values, and what to do when agreement is low.

Jan 2610 min readstatstest_flow Model Evaluation Supporting

Inter-Rater Reliability: Cohen's Kappa and Krippendorff's Alpha

Quick Hits

•Raw agreement is misleading—it doesn't account for chance agreement
•Cohen's Kappa: 2 raters, same items. Krippendorff's Alpha: any number of raters, handles missing data
•Kappa < 0.4 = poor agreement, 0.4-0.6 = moderate, 0.6-0.8 = substantial, > 0.8 = excellent
•Low agreement means noisy labels—your model evaluation will have high variance
•Fix low agreement with better guidelines, rater training, or accepting inherent subjectivity

TL;DR

Human ratings are only useful if raters agree. Inter-rater reliability measures this agreement, correcting for chance. Cohen's Kappa works for two raters; Krippendorff's Alpha handles multiple raters and missing data. Low reliability means noisy labels and unreliable conclusions. This guide covers how to compute, interpret, and improve reliability for AI evaluation tasks.

Why Reliability Matters

The Problem with Raw Agreement

Two raters agree 85% of the time. Sounds good?

But if 90% of examples are "good" and 10% are "bad":

Random raters agree 82% by chance (0.9² + 0.1²)
That 85% is barely better than random

What Reliability Tells You

Reliability	Meaning	Implication
High (>0.8)	Raters consistent	Labels are trustworthy
Moderate (0.4-0.8)	Some agreement	Labels have noise, larger samples needed
Low (<0.4)	Little agreement	Labels unreliable, fix process or accept subjectivity

Cohen's Kappa

The Formula

$\kappa = \frac{p_o - p_e}{1 - p_e}$

Where:

$p_o$ = observed agreement (proportion where raters agree)
$p_e$ = expected agreement by chance

Implementation

import numpy as np
from collections import Counter


def cohens_kappa(rater1, rater2):
    """
    Compute Cohen's Kappa for two raters.

    Parameters:
    -----------
    rater1, rater2 : list
        Ratings from each rater (same length, same items)

    Returns:
    --------
    dict with kappa, observed/expected agreement, SE, and CI
    """
    assert len(rater1) == len(rater2), "Raters must rate same items"
    n = len(rater1)

    # Categories
    categories = sorted(set(rater1) | set(rater2))

    # Observed agreement
    p_o = sum(r1 == r2 for r1, r2 in zip(rater1, rater2)) / n

    # Expected agreement by chance
    p_e = 0
    for cat in categories:
        p1 = sum(r == cat for r in rater1) / n
        p2 = sum(r == cat for r in rater2) / n
        p_e += p1 * p2

    # Kappa
    if p_e == 1:
        kappa = 1.0 if p_o == 1 else 0.0
    else:
        kappa = (p_o - p_e) / (1 - p_e)

    # Standard error (large sample approximation)
    se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))

    # 95% CI
    ci_lower = kappa - 1.96 * se
    ci_upper = kappa + 1.96 * se

    return {
        'kappa': kappa,
        'observed_agreement': p_o,
        'expected_agreement': p_e,
        'se': se,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'n': n,
        'interpretation': interpret_kappa(kappa)
    }


def interpret_kappa(kappa):
    """Landis & Koch interpretation."""
    if kappa < 0:
        return "Poor (worse than chance)"
    elif kappa < 0.20:
        return "Slight"
    elif kappa < 0.40:
        return "Fair"
    elif kappa < 0.60:
        return "Moderate"
    elif kappa < 0.80:
        return "Substantial"
    else:
        return "Almost perfect"


# Example: Quality ratings (bad, okay, good)
np.random.seed(42)
n = 200

# Simulate raters with moderate agreement
true_quality = np.random.choice(['bad', 'okay', 'good'], n, p=[0.2, 0.5, 0.3])

# Raters agree with truth 70% of the time
rater1 = [q if np.random.random() < 0.7 else np.random.choice(['bad', 'okay', 'good'])
          for q in true_quality]
rater2 = [q if np.random.random() < 0.7 else np.random.choice(['bad', 'okay', 'good'])
          for q in true_quality]

result = cohens_kappa(rater1, rater2)

print("Cohen's Kappa Analysis")
print("=" * 50)
print(f"Number of items: {result['n']}")
print(f"Observed agreement: {result['observed_agreement']:.1%}")
print(f"Expected by chance: {result['expected_agreement']:.1%}")
print(f"\nKappa: {result['kappa']:.3f}")
print(f"95% CI: ({result['ci_lower']:.3f}, {result['ci_upper']:.3f})")
print(f"Interpretation: {result['interpretation']}")

Weighted Kappa (Ordinal Data)

When categories are ordered, some disagreements are worse than others:

def weighted_kappa(rater1, rater2, weights='linear'):
    """
    Weighted Cohen's Kappa for ordinal data.

    weights: 'linear' or 'quadratic'
    """
    # Convert to numeric
    categories = sorted(set(rater1) | set(rater2))
    cat_to_num = {cat: i for i, cat in enumerate(categories)}

    r1_num = [cat_to_num[r] for r in rater1]
    r2_num = [cat_to_num[r] for r in rater2]

    n = len(rater1)
    k = len(categories)

    # Weight matrix
    W = np.zeros((k, k))
    for i in range(k):
        for j in range(k):
            if weights == 'linear':
                W[i, j] = abs(i - j) / (k - 1) if k > 1 else 0
            else:  # quadratic
                W[i, j] = ((i - j) / (k - 1))**2 if k > 1 else 0

    # Observed agreement (weighted)
    obs_counts = np.zeros((k, k))
    for r1, r2 in zip(r1_num, r2_num):
        obs_counts[r1, r2] += 1
    obs_counts /= n

    # Expected agreement (weighted)
    marginals_1 = np.array([sum(r == i for r in r1_num) / n for i in range(k)])
    marginals_2 = np.array([sum(r == i for r in r2_num) / n for i in range(k)])
    exp_counts = np.outer(marginals_1, marginals_2)

    # Weighted kappa
    observed_disagreement = np.sum(W * obs_counts)
    expected_disagreement = np.sum(W * exp_counts)

    kappa_w = 1 - observed_disagreement / expected_disagreement if expected_disagreement > 0 else 1

    return {
        'weighted_kappa': kappa_w,
        'weight_type': weights,
        'interpretation': interpret_kappa(kappa_w)
    }


# Example: 5-point scale (1=very bad to 5=very good)
np.random.seed(42)
n = 150

true_rating = np.random.randint(1, 6, n)
rater1 = np.clip(true_rating + np.random.randint(-1, 2, n), 1, 5)
rater2 = np.clip(true_rating + np.random.randint(-1, 2, n), 1, 5)

unweighted = cohens_kappa(list(rater1), list(rater2))
linear = weighted_kappa(list(rater1), list(rater2), 'linear')
quadratic = weighted_kappa(list(rater1), list(rater2), 'quadratic')

print("Weighted Kappa for Ordinal Ratings (1-5 scale)")
print("=" * 50)
print(f"Unweighted Kappa: {unweighted['kappa']:.3f}")
print(f"Linear Weighted: {linear['weighted_kappa']:.3f}")
print(f"Quadratic Weighted: {quadratic['weighted_kappa']:.3f}")
print("\nWeighted kappa is higher because it gives partial credit")
print("for 'close' disagreements (4 vs 5 is better than 1 vs 5)")

Krippendorff's Alpha

When to Use Alpha

More than 2 raters
Missing data (not all raters rate all items)
Any measurement level (nominal, ordinal, interval, ratio)

Implementation

def krippendorff_alpha(data, level='nominal'):
    """
    Krippendorff's Alpha for any number of raters.

    Parameters:
    -----------
    data : array
        Shape (n_raters, n_items), with np.nan for missing
    level : str
        'nominal', 'ordinal', or 'interval'

    Returns:
    --------
    Alpha coefficient and interpretation
    """
    data = np.array(data, dtype=float)
    n_raters, n_items = data.shape

    # Distance function
    def distance(v1, v2, level):
        if level == 'nominal':
            return 0 if v1 == v2 else 1
        elif level == 'interval':
            return (v1 - v2) ** 2
        else:  # ordinal
            return abs(v1 - v2)

    # Observed disagreement (within items)
    D_o = 0
    n_pairs_obs = 0

    for item in range(n_items):
        ratings = data[:, item]
        ratings = ratings[~np.isnan(ratings)]

        if len(ratings) < 2:
            continue

        for i in range(len(ratings)):
            for j in range(i + 1, len(ratings)):
                D_o += distance(ratings[i], ratings[j], level)
                n_pairs_obs += 1

    if n_pairs_obs == 0:
        return {'alpha': np.nan, 'error': 'Not enough overlapping ratings'}

    D_o /= n_pairs_obs

    # Expected disagreement (across all ratings)
    all_ratings = data[~np.isnan(data)]

    D_e = 0
    n_pairs_exp = 0

    for i in range(len(all_ratings)):
        for j in range(i + 1, len(all_ratings)):
            D_e += distance(all_ratings[i], all_ratings[j], level)
            n_pairs_exp += 1

    D_e /= n_pairs_exp if n_pairs_exp > 0 else 1

    # Alpha
    alpha = 1 - D_o / D_e if D_e > 0 else 1

    return {
        'alpha': alpha,
        'observed_disagreement': D_o,
        'expected_disagreement': D_e,
        'level': level,
        'n_items': n_items,
        'n_raters': n_raters,
        'interpretation': interpret_kappa(alpha)
    }


# Example: 3 raters, some missing data
np.random.seed(42)
n_items = 100
n_raters = 3

# True values
true_values = np.random.choice([1, 2, 3, 4, 5], n_items)

# Raters with noise and missing data
data = np.zeros((n_raters, n_items))
for r in range(n_raters):
    for i in range(n_items):
        if np.random.random() < 0.9:  # 10% missing
            noise = np.random.choice([-1, 0, 0, 0, 1])
            data[r, i] = np.clip(true_values[i] + noise, 1, 5)
        else:
            data[r, i] = np.nan

result = krippendorff_alpha(data, level='interval')

print("Krippendorff's Alpha Analysis")
print("=" * 50)
print(f"Raters: {result['n_raters']}")
print(f"Items: {result['n_items']}")
print(f"Measurement level: {result['level']}")
print(f"\nAlpha: {result['alpha']:.3f}")
print(f"Interpretation: {result['interpretation']}")

Diagnosing Low Agreement

Step 1: Check Category Distribution

def agreement_by_category(rater1, rater2):
    """
    Break down agreement by category.
    """
    categories = sorted(set(rater1) | set(rater2))

    results = []
    for cat in categories:
        # Items where rater1 said this category
        r1_cat = [i for i, r in enumerate(rater1) if r == cat]
        # What did rater2 say for those items?
        if r1_cat:
            r2_ratings = [rater2[i] for i in r1_cat]
            agree = sum(r == cat for r in r2_ratings)
            results.append({
                'category': cat,
                'r1_count': len(r1_cat),
                'agreement_when_r1': agree / len(r1_cat)
            })

    return results


# Example
np.random.seed(42)
n = 200
true_cat = np.random.choice(['bad', 'okay', 'good'], n, p=[0.15, 0.60, 0.25])
rater1 = [t if np.random.random() < 0.7 else np.random.choice(['bad', 'okay', 'good']) for t in true_cat]
rater2 = [t if np.random.random() < 0.6 else np.random.choice(['bad', 'okay', 'good']) for t in true_cat]

breakdown = agreement_by_category(rater1, rater2)
print("Agreement by Category")
print("=" * 50)
for row in breakdown:
    print(f"{row['category']:<10} (n={row['r1_count']:<3}): {row['agreement_when_r1']:.1%} agreement")

Step 2: Find Systematic Disagreements

def confusion_matrix_raters(rater1, rater2):
    """
    Confusion matrix between raters.
    """
    categories = sorted(set(rater1) | set(rater2))
    matrix = {c1: {c2: 0 for c2 in categories} for c1 in categories}

    for r1, r2 in zip(rater1, rater2):
        matrix[r1][r2] += 1

    return matrix


cm = confusion_matrix_raters(rater1, rater2)
print("\nConfusion Matrix (Rater 1 rows, Rater 2 columns)")
print("=" * 50)
cats = ['bad', 'okay', 'good']
print(f"{'':>10}", end='')
for c in cats:
    print(f"{c:>10}", end='')
print()
for r1 in cats:
    print(f"{r1:>10}", end='')
    for r2 in cats:
        print(f"{cm[r1][r2]:>10}", end='')
    print()

Improving Reliability

Strategy 1: Better Guidelines

## Rating Guidelines for Response Quality

### "Good" (select if ANY of these apply):
- Directly answers the question
- Provides accurate information
- Is appropriately detailed

### "Okay" (select if):
- Partially addresses the question
- Has minor errors or omissions
- Reasonable but not ideal

### "Bad" (select if ANY of these apply):
- Doesn't answer the question
- Contains significant errors
- Is confusing or unhelpful

Strategy 2: Calibration Sessions

def calibration_check(gold_labels, rater_labels):
    """
    Check rater against gold standard during calibration.
    """
    n = len(gold_labels)
    accuracy = sum(g == r for g, r in zip(gold_labels, rater_labels)) / n

    # Confusion matrix
    categories = sorted(set(gold_labels))
    errors = {}

    for gold, rated in zip(gold_labels, rater_labels):
        if gold != rated:
            key = f"{gold} → {rated}"
            errors[key] = errors.get(key, 0) + 1

    return {
        'accuracy': accuracy,
        'common_errors': sorted(errors.items(), key=lambda x: -x[1])[:5]
    }


# Example: Calibration feedback
gold = ['good', 'good', 'okay', 'bad', 'okay', 'good', 'okay', 'okay', 'good', 'bad']
rater = ['good', 'okay', 'okay', 'okay', 'okay', 'good', 'good', 'okay', 'good', 'bad']

result = calibration_check(gold, rater)
print("Calibration Results")
print("=" * 50)
print(f"Accuracy vs gold standard: {result['accuracy']:.1%}")
print("\nCommon errors:")
for error, count in result['common_errors']:
    print(f"  {error}: {count} times")

Strategy 3: Multiple Raters + Aggregation

def majority_vote(ratings):
    """Aggregate multiple raters via majority vote."""
    from collections import Counter
    counts = Counter(r for r in ratings if r is not None)
    if counts:
        return counts.most_common(1)[0][0]
    return None


def agreement_with_majority(data):
    """
    Measure each rater's agreement with majority.
    """
    n_raters, n_items = data.shape

    majorities = []
    for item in range(n_items):
        ratings = [data[r, item] for r in range(n_raters) if not np.isnan(data[r, item])]
        majorities.append(majority_vote(ratings))

    rater_agreement = []
    for r in range(n_raters):
        agree = sum(data[r, i] == majorities[i]
                    for i in range(n_items)
                    if not np.isnan(data[r, i]) and majorities[i] is not None)
        total = sum(1 for i in range(n_items) if not np.isnan(data[r, i]))
        rater_agreement.append(agree / total if total > 0 else np.nan)

    return {
        'rater_agreements': rater_agreement,
        'mean_agreement': np.nanmean(rater_agreement)
    }

R Implementation

library(irr)

# Cohen's Kappa (2 raters)
kappa2(data.frame(rater1, rater2))

# Weighted Kappa
kappa2(data.frame(rater1, rater2), weight = "squared")

# Krippendorff's Alpha (multiple raters)
kripp.alpha(t(ratings_matrix), method = "interval")

# Fleiss' Kappa (multiple raters, nominal)
kappam.fleiss(ratings_matrix)

Reporting Reliability

Template

## Inter-Rater Reliability

### Setup
- **Raters**: 3 trained annotators
- **Items**: 500 model responses
- **Rating scheme**: 3-point scale (bad, okay, good)
- **Training**: 2-hour calibration session, 50 gold-standard examples

### Results
- **Krippendorff's Alpha**: 0.67 (95% CI: 0.61-0.73)
- **Interpretation**: Substantial agreement

### Pairwise Agreement
| Pair | Kappa |
|------|-------|
| R1-R2 | 0.65 |
| R1-R3 | 0.68 |
| R2-R3 | 0.71 |

### Implications
Moderate-to-substantial agreement suggests labels are reliable
enough for comparative evaluation, though absolute quality
judgments should be interpreted with caution.

Model Evaluation (Pillar) - Complete evaluation framework
Comparing Models: Win Rate - Using ratings for comparison
Multiple Metrics False Discoveries - Multiple testing
Meaningful vs. Significant - Interpreting results

Key Takeaway

Human ratings are only as good as rater agreement. Measure reliability before trusting labels—Kappa and Alpha correct for chance, giving you a true measure of consistency. Low agreement (<0.4) means your labels are noisy; any conclusions drawn from them will have high uncertainty. Improve reliability through clear guidelines, calibration sessions, and rater training. When inherent subjectivity limits agreement, acknowledge this in your analysis and increase sample sizes to compensate for the noise.

References

https://doi.org/10.1177/001316446002000104
https://repository.upenn.edu/asc_papers/43/
https://doi.org/10.1037/1082-989X.7.1.105
Cohen, J. (1960). A coefficient of agreement for nominal scales. *Educational and Psychological Measurement*, 20(1), 37-46.
Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and recommendations. *Human Communication Research*, 30(3), 411-433.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. *Biometrics*, 33(1), 159-174.

Frequently Asked Questions

What's a good Kappa/Alpha value?

Depends on context. For objective tasks (factual correctness), expect >0.8. For subjective tasks (quality, helpfulness), 0.6+ is often acceptable. For highly subjective tasks (creativity, preference), 0.4-0.6 may be the ceiling. The key is whether agreement is high enough for your analysis to be meaningful.

Why is my Kappa low even though raters agree 80% of the time?

Kappa corrects for chance agreement. If your categories are imbalanced (e.g., 90% 'good', 10% 'bad'), two random raters would agree 82% of the time by chance. Kappa captures agreement beyond this baseline. This is why Kappa can be low even with high raw agreement.

Which should I use: Kappa or Alpha?

Use Cohen's Kappa for 2 raters rating the same items with no missing data. Use Krippendorff's Alpha for more than 2 raters, missing ratings, or ordinal/interval scales. Alpha is more general but requires more computation.

Key Takeaway

Inter-rater reliability measures whether human judgments are consistent enough to trust. Kappa and Alpha correct for chance agreement, giving you a true measure of reliability. Low agreement (<0.4) means your labels are noisy—any model evaluation based on them will have high variance. Before using human ratings for evaluation, always measure and report reliability. If agreement is low, improve guidelines, add training, or acknowledge the uncertainty in your conclusions.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email