Relationship

Cohen's Kappa

Cohen's Kappa measures inter-rater agreement between two raters classifying items into categories. Use it to quantify how much two raters agree beyond what would be expected by chance.

Jan 293 min readRelationship Agreement Inter-Rater Reliability

Quick Hits

•Measures agreement between exactly two raters on categorical classifications
•Accounts for agreement expected by chance (unlike simple percent agreement)
•Ranges from -1 (systematic disagreement) through 0 (chance) to 1 (perfect agreement)
•Common benchmarks: < 0.2 slight, 0.2-0.4 fair, 0.4-0.6 moderate, 0.6-0.8 substantial, 0.8-1.0 near-perfect
•Requires the same two raters to classify the same set of items

The StatsTest Flow: Relationship or Prediction >> Relationship >> Agreement between raters >> Two raters

Not sure this is the right statistical method? Use the Choose Your StatsTest workflow to select the right method.

What is Cohen's Kappa?

Cohen's Kappa ( $\kappa$ ) is a statistic that measures the agreement between two raters who each classify items into mutually exclusive categories. Unlike simple percent agreement, Cohen's kappa accounts for the probability of agreement occurring by chance.

The formula is:

$\kappa = \frac{p_o - p_e}{1 - p_e}$

Where $p_o$ is the observed proportion of agreement and $p_e$ is the expected proportion of agreement by chance.

Cohen's Kappa is also called the Kappa Statistic, Kappa Coefficient, or Cohen's Kappa Coefficient.

Assumptions for Cohen's Kappa

The assumptions include:

Two Raters
Same Items Rated
Categorical Outcome
Mutually Exclusive Categories
Independent Ratings

Two Raters

Cohen's kappa is defined for exactly two raters. Both raters must classify the same set of items.

If you have three or more raters, use Krippendorff's Alpha or Fleiss' kappa instead.

Same Items Rated

Both raters must rate the same set of items. Missing ratings reduce the effective sample size.

Categorical Outcome

The rating must be categorical. For continuous ratings (e.g., scores on a 0-100 scale), use intraclass correlation (ICC) instead.

Mutually Exclusive Categories

Each item must be classified into exactly one category. Multi-label classifications require alternative approaches.

Independent Ratings

The two raters must make their judgments independently without consulting each other.

When to use Cohen's Kappa?

Two raters are classifying the same items
The outcome is categorical (two or more categories)
You want to measure agreement beyond chance
Ratings are independent

Common Applications

Data labeling quality: Do two annotators agree on sentiment labels?
Content moderation: Do two reviewers agree on policy violation classifications?
Clinical diagnosis: Do two physicians agree on disease staging?
Code review: Do two reviewers agree on severity classifications?

For measuring the strength of association between two categorical variables (not rater agreement), use Cramer's V or the Phi Coefficient.

Cohen's Kappa Example

Two content moderators independently classify 200 user posts as "safe," "borderline," or "violation."

	Rater B: Safe	Rater B: Borderline	Rater B: Violation
Rater A: Safe	120	8	2
Rater A: Borderline	10	30	5
Rater A: Violation	1	4	20

Observed agreement: (120 + 30 + 20) / 200 = 85%. Expected chance agreement: 52.5%. Cohen's kappa: $\kappa$ = (0.85 - 0.525) / (1 - 0.525) = 0.68.

Interpretation: $\kappa$ = 0.68 indicates substantial agreement. The raters agree well beyond what chance would predict, though there is room for improvement on borderline cases. Since the categories are ordinal, a weighted kappa would give partial credit for near-misses (e.g., "safe" vs. "borderline").

References

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900058/
https://www.sciencedirect.com/topics/medicine-and-dentistry/cohens-kappa

Frequently Asked Questions

What is the difference between Cohen's kappa and percent agreement?

Percent agreement is the proportion of items where raters agree. Cohen's kappa adjusts for the probability that they would agree by chance alone. If categories are imbalanced (e.g., 90% positive), chance agreement is high and percent agreement will overstate true agreement. Kappa corrects for this.

When should I use weighted kappa?

Use weighted kappa when your categories are ordinal (e.g., mild/moderate/severe). Weighted kappa assigns partial credit for near-agreements: rating 'mild' when the other rater says 'moderate' is penalized less than rating 'mild' when they say 'severe'. For nominal categories without ordering, use unweighted kappa.

What if I have more than two raters?

Cohen's kappa is defined for exactly two raters. For three or more raters, use Krippendorff's Alpha or Fleiss' kappa instead.

Key Takeaway

Cohen's kappa is the standard measure of agreement between two raters on categorical labels, corrected for chance. It tells you whether your raters are consistently agreeing beyond what you would expect if they were guessing. Use it for quality control in labeling, content moderation, clinical diagnosis, and any task where two people classify the same items.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email