Two-Group Comparisons

Comparing Medians: Statistical Tests and Better Options

When you need to compare medians instead of means, standard tests often fall short. Learn about Mood's median test, quantile regression, and bootstrap methods for proper median comparison.

Share

Quick Hits

  • Mann-Whitney doesn't test medians—it tests stochastic dominance
  • Mood's median test has low power but actually tests what it claims
  • Quantile regression is more flexible and extends to any percentile
  • Bootstrap is often the simplest approach for median confidence intervals

TL;DR

Comparing medians is harder than it looks. Mann-Whitney doesn't actually test medians (it tests stochastic dominance). Mood's median test actually tests medians but has low power. Quantile regression offers flexibility. Bootstrap provides straightforward confidence intervals. Choose based on your actual question and data.


Why Medians?

Medians matter when:

  1. Outliers dominate: A few extreme values pull the mean far from "typical"
  2. Skewed distributions: Mean and median differ substantially
  3. Ordinal data: Arithmetic mean isn't meaningful
  4. "Typical user" matters: You care about the 50th percentile experience

But consider: for many business metrics, totals matter. Total revenue = mean × users, not median × users. Medians can hide important variation.


Method 1: Mood's Median Test

The only classic test that actually tests whether medians differ.

How It Works

  1. Find the grand median (median of all observations combined)
  2. Count how many observations in each group fall above/below the grand median
  3. Test whether these counts differ between groups (chi-square test)
from scipy import stats
import numpy as np

def moods_median_test(group1, group2):
    """
    Mood's median test for comparing medians.
    """
    stat, p_value, grand_median, table = stats.median_test(group1, group2)

    return {
        'statistic': stat,
        'p_value': p_value,
        'grand_median': grand_median,
        'contingency_table': table,
        'group1_median': np.median(group1),
        'group2_median': np.median(group2)
    }


# Example
np.random.seed(42)
group1 = np.random.exponential(10, 100)
group2 = np.random.exponential(15, 100)

result = moods_median_test(group1, group2)
print(f"Group 1 median: {result['group1_median']:.2f}")
print(f"Group 2 median: {result['group2_median']:.2f}")
print(f"P-value: {result['p_value']:.4f}")

R Implementation

# Mood's median test in R
library(RVAideMemoire)
mood.medtest(value ~ group, data = df)

# Or using base R approach
grand_median <- median(c(group1, group2))
table <- matrix(c(
  sum(group1 > grand_median), sum(group1 <= grand_median),
  sum(group2 > grand_median), sum(group2 <= grand_median)
), nrow = 2, byrow = TRUE)
chisq.test(table)

Limitations

Low power: Mood's test only uses information about whether observations are above/below the median. It ignores how far above or below. This wastes information.

Ties at median: When observations equal the grand median, results depend on how ties are handled.

Large samples needed: Due to low power, you need substantial samples for reliable detection.


Method 2: Bootstrap

The most straightforward approach: resample to estimate the median difference and its uncertainty.

Implementation

import numpy as np

def bootstrap_median_comparison(group1, group2, n_bootstrap=10000, alpha=0.05):
    """
    Bootstrap comparison of medians.
    """
    observed_diff = np.median(group2) - np.median(group1)

    # Bootstrap distribution of difference
    diffs = []
    for _ in range(n_bootstrap):
        boot1 = np.random.choice(group1, size=len(group1), replace=True)
        boot2 = np.random.choice(group2, size=len(group2), replace=True)
        diffs.append(np.median(boot2) - np.median(boot1))

    diffs = np.array(diffs)

    # Percentile confidence interval
    ci_lower = np.percentile(diffs, 100 * alpha / 2)
    ci_upper = np.percentile(diffs, 100 * (1 - alpha / 2))

    # Bootstrap p-value via permutation
    combined = np.concatenate([group1, group2])
    null_diffs = []
    for _ in range(n_bootstrap):
        np.random.shuffle(combined)
        perm1 = combined[:len(group1)]
        perm2 = combined[len(group1):]
        null_diffs.append(np.median(perm2) - np.median(perm1))

    null_diffs = np.array(null_diffs)
    p_value = np.mean(np.abs(null_diffs) >= np.abs(observed_diff))

    return {
        'median_group1': np.median(group1),
        'median_group2': np.median(group2),
        'median_difference': observed_diff,
        'ci_95': (ci_lower, ci_upper),
        'p_value': p_value
    }


# Example
result = bootstrap_median_comparison(group1, group2)
print(f"Median difference: {result['median_difference']:.2f}")
print(f"95% CI: [{result['ci_95'][0]:.2f}, {result['ci_95'][1]:.2f}]")
print(f"P-value: {result['p_value']:.4f}")

Advantages

  • Simple and intuitive: Directly estimates what you want
  • Honest uncertainty: CI width reflects actual sampling variability
  • Flexible: Works for any percentile, not just median
  • No distributional assumptions: Works for any data

Considerations

  • Computational cost: Requires many resamples (10,000+)
  • Discrete medians: With discrete data, bootstrap medians can be jumpy
  • Small samples: Performance degrades with very small n

Method 3: Quantile Regression

Regression framework for estimating conditional quantiles. More powerful than Mood's test and extends to include covariates.

Python Implementation

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

def quantile_regression_comparison(group1, group2, quantile=0.5):
    """
    Compare groups using quantile regression.
    """
    # Create dataframe
    df = pd.DataFrame({
        'value': np.concatenate([group1, group2]),
        'treatment': [0] * len(group1) + [1] * len(group2)
    })

    # Fit quantile regression
    model = smf.quantreg('value ~ treatment', df)
    result = model.fit(q=quantile)

    return {
        'quantile': quantile,
        'coefficient': result.params['treatment'],
        'std_error': result.bse['treatment'],
        'p_value': result.pvalues['treatment'],
        'ci_lower': result.conf_int().loc['treatment', 0],
        'ci_upper': result.conf_int().loc['treatment', 1]
    }


# Example: compare medians
result = quantile_regression_comparison(group1, group2, quantile=0.5)
print(f"Median difference estimate: {result['coefficient']:.2f}")
print(f"95% CI: [{result['ci_lower']:.2f}, {result['ci_upper']:.2f}]")
print(f"P-value: {result['p_value']:.4f}")

R Implementation

library(quantreg)

df <- data.frame(
  value = c(group1, group2),
  treatment = c(rep(0, length(group1)), rep(1, length(group2)))
)

# Median regression
fit <- rq(value ~ treatment, data = df, tau = 0.5)
summary(fit)

Advantages

  • More power than Mood's: Uses full information about values
  • Handles covariates: Can adjust for confounders
  • Any quantile: Estimate 25th, 75th percentile differences too
  • Proper inference: Standard errors and CIs

Comparing Multiple Quantiles

def multi_quantile_comparison(group1, group2, quantiles=[0.25, 0.5, 0.75]):
    """
    Compare groups across multiple quantiles.
    """
    df = pd.DataFrame({
        'value': np.concatenate([group1, group2]),
        'treatment': [0] * len(group1) + [1] * len(group2)
    })

    results = []
    for q in quantiles:
        model = smf.quantreg('value ~ treatment', df)
        fit = model.fit(q=q)
        results.append({
            'quantile': q,
            'effect': fit.params['treatment'],
            'p_value': fit.pvalues['treatment']
        })

    return pd.DataFrame(results)


# Compare across distribution
comparison = multi_quantile_comparison(group1, group2)
print(comparison)

Why Mann-Whitney Doesn't Test Medians

A common misconception: "Mann-Whitney tests whether medians differ."

The Truth

Mann-Whitney tests whether P(X > Y) = 0.5. This equals testing medians ONLY if you assume:

  • Both distributions have the same shape
  • One is just a shifted version of the other

Without this assumption, Mann-Whitney and median tests can disagree:

# Example where Mann-Whitney and median tests disagree
np.random.seed(123)

# Group 1: Normal(0, 1)
group1 = np.random.normal(0, 1, 200)

# Group 2: Same median (0), but different shape
group2 = np.concatenate([
    np.random.uniform(-3, 0, 100),   # Below median
    np.random.uniform(0, 0.5, 100)   # Above median but close
])

print(f"Group 1 median: {np.median(group1):.2f}")
print(f"Group 2 median: {np.median(group2):.2f}")

# Mood's median test: no difference
mood_stat, mood_p, _, _ = stats.median_test(group1, group2)
print(f"\nMood's test p-value: {mood_p:.4f}")

# Mann-Whitney: significant difference!
mw_stat, mw_p = stats.mannwhitneyu(group1, group2)
print(f"Mann-Whitney p-value: {mw_p:.4f}")

Medians are nearly equal, but Mann-Whitney detects that the distributions differ in stochastic ordering.


Decision Guide

Situation Recommended Method
Simple median comparison, large samples Mood's median test or bootstrap
Median comparison with covariates Quantile regression
Want confidence interval Bootstrap or quantile regression
Small samples Bootstrap
Multiple quantiles Quantile regression
Quick robustness check Mann-Whitney (knowing its limitations)

Practical Example

# A/B test with revenue data
np.random.seed(42)

# Control: typical revenue distribution
control = np.concatenate([
    np.zeros(500),  # Non-purchasers
    np.random.exponential(50, 400),  # Regular purchasers
    np.random.exponential(500, 100)  # Whales
])

# Treatment: shifts some non-purchasers to purchasers
treatment = np.concatenate([
    np.zeros(400),  # Fewer non-purchasers
    np.random.exponential(50, 500),  # More regular purchasers
    np.random.exponential(500, 100)  # Same whales
])

print("Means:")
print(f"  Control: ${np.mean(control):.2f}")
print(f"  Treatment: ${np.mean(treatment):.2f}")

print("\nMedians:")
print(f"  Control: ${np.median(control):.2f}")
print(f"  Treatment: ${np.median(treatment):.2f}")

# Bootstrap comparison of medians
result = bootstrap_median_comparison(control, treatment)
print(f"\nMedian difference: ${result['median_difference']:.2f}")
print(f"95% CI: [${result['ci_95'][0]:.2f}, ${result['ci_95'][1]:.2f}]")


Key Takeaway

If you want to compare medians, use a method that actually tests medians: bootstrap for simplicity, quantile regression for flexibility and covariates, or Mood's median test for a basic significance test. Mann-Whitney tests something different (stochastic dominance) despite common misconception. Always be clear about whether median is the right summary—for many business metrics, mean matters more.


References

  1. https://www.jstor.org/stable/2684360
  2. https://www.jstor.org/stable/2332641
  3. Koenker, R., & Bassett Jr, G. (1978). Regression quantiles. *Econometrica*, 46(1), 33-50.
  4. Mood, A. M. (1954). On the asymptotic efficiency of certain nonparametric two-sample tests. *The Annals of Mathematical Statistics*, 25(3), 514-522.
  5. Divine, G. W., Norton, H. J., Barón, A. E., & Juarez-Colunga, E. (2018). The Wilcoxon-Mann-Whitney procedure fails as a test of medians. *The American Statistician*, 72(3), 278-286.

Frequently Asked Questions

Doesn't Mann-Whitney test whether medians differ?
Only under the assumption that distributions have identical shape (just shifted). Without that assumption, Mann-Whitney tests stochastic dominance, which can differ from median comparison.
When should I compare medians instead of means?
When outliers heavily influence means, when data is ordinal, or when 'typical' value matters more than 'total.' For revenue, means often matter more (total = mean × n).
What's the best test for comparing medians?
Bootstrap for simplicity, quantile regression for flexibility and covariates, Mood's median test for a simple significance test. Mann-Whitney only if you're willing to assume equal shapes.

Key Takeaway

If you want to compare medians, use a method that actually tests medians: bootstrap, quantile regression, or Mood's median test. Mann-Whitney tests something different (stochastic dominance) despite common misconception.

Send to a friend

Share this with someone who loves clean statistical work.