Assumptions

Robust Statistics Toolbox: Trimmed Means, Winsorization, and Rank Methods

A practical guide to robust statistical methods that work without normality assumptions. Learn when to use trimmed means, Winsorization, M-estimators, and rank-based tests.

Jan 269 min readstatstest_flow Assumptions Supporting

Robust Statistics Toolbox: Trimmed Means, Winsorization, and Rank Methods

Quick Hits

•Robust methods sacrifice a little efficiency for much better protection against outliers
•Trimmed means ignore extreme values; Winsorized means replace them
•20% trimming is a common default that works well across situations
•Rank-based tests (Mann-Whitney, Wilcoxon) are robust but test different hypotheses

TL;DR

Robust statistics protect your analysis from outliers and non-normality. Trimmed means ignore a percentage of extreme values; Winsorized means replace them with boundary values. Both reduce outlier influence while maintaining reasonable efficiency. For inference, Yuen's test compares trimmed means, and bootstrap provides flexible confidence intervals. When data quality is uncertain, robust methods are cheap insurance.

Why Robust Methods?

The Outlier Problem

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def demonstrate_outlier_impact():
    """
    Show how outliers distort standard statistics.
    """
    np.random.seed(42)

    # Clean data
    clean_data = np.random.normal(50, 10, 100)

    # Same data with outliers
    contaminated = clean_data.copy()
    contaminated[0] = 500  # One extreme outlier

    print("Impact of a Single Outlier:")
    print("-" * 50)
    print(f"{'Statistic':<20} {'Clean':>12} {'Contaminated':>12} {'Change':>10}")
    print("-" * 50)

    stats_compare = [
        ('Mean', np.mean, clean_data, contaminated),
        ('Median', np.median, clean_data, contaminated),
        ('Std Dev', lambda x: np.std(x, ddof=1), clean_data, contaminated),
        ('IQR', stats.iqr, clean_data, contaminated),
        ('Trimmed Mean (10%)', lambda x: stats.trim_mean(x, 0.1), clean_data, contaminated)
    ]

    for name, func, c, cont in stats_compare:
        clean_val = func(c)
        cont_val = func(cont)
        pct_change = (cont_val - clean_val) / clean_val * 100
        print(f"{name:<20} {clean_val:>12.2f} {cont_val:>12.2f} {pct_change:>+9.1f}%")


demonstrate_outlier_impact()

Breakdown Point

The breakdown point is the proportion of data that can be arbitrarily corrupted before a statistic becomes meaningless.

Higher breakdown point = more robust:

Mean: 0% — a single outlier can destroy it
Median: 50% — half the data can be corrupted
Trimmed mean (20%): 20%
Winsorized mean (20%): 20%
Huber M-estimator: ~50%

Trimmed Means

How They Work

Trimmed means remove a percentage from each tail before averaging.

from scipy.stats import trim_mean

def trimmed_mean_explanation(data, trim_proportion=0.2):
    """
    Step-by-step trimmed mean calculation.
    """
    n = len(data)
    k = int(n * trim_proportion)  # Number to trim from each side

    sorted_data = np.sort(data)
    trimmed = sorted_data[k:n-k]

    print(f"Trimmed Mean ({trim_proportion:.0%} trim):")
    print("-" * 40)
    print(f"Original n: {n}")
    print(f"Trim {k} from each tail")
    print(f"Remaining n: {len(trimmed)}")
    print()
    print(f"Values trimmed from bottom: {sorted_data[:k]}")
    print(f"Values trimmed from top: {sorted_data[n-k:]}")
    print()
    print(f"Standard mean: {np.mean(data):.2f}")
    print(f"Trimmed mean: {np.mean(trimmed):.2f}")

    return np.mean(trimmed)


# Example
np.random.seed(42)
data = np.concatenate([
    np.random.normal(50, 10, 90),
    np.array([150, 200, -50, 300, -30])  # Outliers
])

trimmed_mean_explanation(data, 0.1)

Choosing Trim Proportion

def trim_proportion_comparison(data):
    """
    Compare different trimming levels.
    """
    trim_levels = [0, 0.05, 0.10, 0.15, 0.20, 0.25]

    print("Trimmed Mean by Trim Level:")
    print("-" * 40)
    print(f"{'Trim %':>8} {'Trimmed Mean':>15} {'SE Estimate':>15}")
    print("-" * 40)

    for trim in trim_levels:
        tm = trim_mean(data, trim)
        # Approximate SE for trimmed mean
        n = len(data)
        k = int(n * trim)
        sorted_data = np.sort(data)
        trimmed = sorted_data[k:n-k] if k > 0 else sorted_data
        se = np.std(trimmed, ddof=1) / np.sqrt(len(trimmed))

        print(f"{trim:>7.0%} {tm:>15.2f} {se:>15.2f}")

    print()
    print("Common choices:")
    print("  10%: Mild protection, high efficiency")
    print("  20%: Good balance (recommended default)")
    print("  25%: Strong protection, lower efficiency")


np.random.seed(42)
data = np.concatenate([
    np.random.normal(100, 20, 80),
    np.random.exponential(100, 20)  # Add right skew
])
trim_proportion_comparison(data)

Winsorized Means

How They Work

Winsorized means replace extreme values instead of removing them.

from scipy.stats import mstats

def winsorized_mean_explanation(data, limits=0.1):
    """
    Step-by-step Winsorized mean calculation.
    """
    n = len(data)
    k = int(n * limits)

    sorted_data = np.sort(data)
    lower_bound = sorted_data[k]
    upper_bound = sorted_data[n-k-1]

    winsorized = np.clip(data, lower_bound, upper_bound)

    print(f"Winsorized Mean ({limits:.0%} each tail):")
    print("-" * 40)
    print(f"Original n: {n}")
    print(f"Lower bound (k={k}th value): {lower_bound:.2f}")
    print(f"Upper bound: {upper_bound:.2f}")
    print()
    print("Values below lower bound → replaced with lower bound")
    print("Values above upper bound → replaced with upper bound")
    print()
    print(f"Standard mean: {np.mean(data):.2f}")
    print(f"Winsorized mean: {np.mean(winsorized):.2f}")

    # Compare to trimmed
    print(f"Trimmed mean ({limits:.0%}): {trim_mean(data, limits):.2f}")

    return np.mean(winsorized)


# Example
np.random.seed(42)
data = np.concatenate([
    np.random.normal(50, 10, 95),
    np.array([150, 200, 250, -20, -40])  # Outliers
])

winsorized_mean_explanation(data, 0.05)

Winsorized Variance

def winsorized_variance(data, limits=0.1):
    """
    Calculate Winsorized variance for standard error estimation.
    """
    n = len(data)
    k = int(n * limits)

    sorted_data = np.sort(data)
    lower = sorted_data[k]
    upper = sorted_data[n-k-1]

    winsorized = np.clip(data, lower, upper)
    wvar = np.var(winsorized, ddof=1)

    # Standard error of Winsorized mean
    se = np.sqrt(wvar / n)

    return {
        'winsorized_mean': np.mean(winsorized),
        'winsorized_var': wvar,
        'standard_error': se
    }

Yuen's Test for Trimmed Means

Comparing Two Groups

def yuen_test(group1, group2, trim=0.2):
    """
    Yuen's test for comparing trimmed means.
    """
    n1, n2 = len(group1), len(group2)
    k1, k2 = int(n1 * trim), int(n2 * trim)

    # Trimmed means
    tm1 = trim_mean(group1, trim)
    tm2 = trim_mean(group2, trim)

    # Winsorized data for variance estimation
    sorted1 = np.sort(group1)
    sorted2 = np.sort(group2)

    w1 = np.clip(group1, sorted1[k1], sorted1[n1-k1-1])
    w2 = np.clip(group2, sorted2[k2], sorted2[n2-k2-1])

    # Winsorized variances
    wvar1 = np.var(w1, ddof=1)
    wvar2 = np.var(w2, ddof=1)

    # Effective sample sizes
    h1 = n1 - 2 * k1
    h2 = n2 - 2 * k2

    # Standard error of difference
    se = np.sqrt(wvar1 / h1 + wvar2 / h2)

    # Test statistic
    t_stat = (tm1 - tm2) / se

    # Approximate degrees of freedom (Welch-Satterthwaite)
    num = (wvar1/h1 + wvar2/h2)**2
    denom = (wvar1/h1)**2/(h1-1) + (wvar2/h2)**2/(h2-1)
    df = num / denom

    # P-value
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))

    return {
        'trimmed_mean_1': tm1,
        'trimmed_mean_2': tm2,
        'difference': tm1 - tm2,
        't_statistic': t_stat,
        'df': df,
        'p_value': p_value,
        'trim_proportion': trim
    }


# Example
np.random.seed(42)
control = np.concatenate([np.random.normal(50, 10, 40), [150, 200]])
treatment = np.concatenate([np.random.normal(55, 10, 40), [160, 180]])

# Standard t-test
t_standard, p_standard = stats.ttest_ind(control, treatment)

# Yuen's test
yuen_result = yuen_test(control, treatment, trim=0.2)

print("Comparison: Standard t-test vs. Yuen's Test")
print("-" * 50)
print(f"\nStandard t-test:")
print(f"  Means: {np.mean(control):.2f} vs {np.mean(treatment):.2f}")
print(f"  t = {t_standard:.3f}, p = {p_standard:.4f}")
print(f"\nYuen's test (20% trimmed):")
print(f"  Trimmed means: {yuen_result['trimmed_mean_1']:.2f} vs {yuen_result['trimmed_mean_2']:.2f}")
print(f"  t = {yuen_result['t_statistic']:.3f}, p = {yuen_result['p_value']:.4f}")

Rank-Based Methods

Overview

Mann-Whitney U — Compares two independent groups. Tests stochastic dominance (not just medians). Assumes independence and similar shape (for median interpretation).

Wilcoxon Signed-Rank — Compares paired differences. Tests symmetry around zero. Assumes independence of pairs and symmetric differences.

Kruskal-Wallis — Compares multiple independent groups. Tests identical distributions. Assumes independence and similar shape across groups.

def compare_parametric_nonparametric(group1, group2):
    """
    Compare parametric and rank-based tests.
    """
    # t-test
    t_stat, p_t = stats.ttest_ind(group1, group2)

    # Mann-Whitney
    u_stat, p_mw = stats.mannwhitneyu(group1, group2, alternative='two-sided')

    print("Parametric vs. Rank-Based:")
    print("-" * 40)
    print(f"t-test: t = {t_stat:.3f}, p = {p_t:.4f}")
    print(f"Mann-Whitney U: U = {u_stat:.0f}, p = {p_mw:.4f}")
    print()
    print("Descriptives:")
    print(f"  Group 1: mean = {np.mean(group1):.2f}, median = {np.median(group1):.2f}")
    print(f"  Group 2: mean = {np.mean(group2):.2f}, median = {np.median(group2):.2f}")


# Example with skewed data
np.random.seed(42)
g1 = np.random.exponential(10, 50)
g2 = np.random.exponential(15, 50)

compare_parametric_nonparametric(g1, g2)

When Rank Tests Are Better

def rank_test_advantages():
    """
    Demonstrate when rank tests outperform parametric tests.
    """
    np.random.seed(42)
    n_sims = 5000

    # Scenario: Heavy-tailed data (Cauchy)
    t_reject = 0
    mw_reject = 0

    for _ in range(n_sims):
        # Both from same Cauchy (null true)
        g1 = stats.cauchy.rvs(size=30)
        g2 = stats.cauchy.rvs(size=30)

        _, p_t = stats.ttest_ind(g1, g2)
        _, p_mw = stats.mannwhitneyu(g1, g2, alternative='two-sided')

        if p_t < 0.05:
            t_reject += 1
        if p_mw < 0.05:
            mw_reject += 1

    print("Type I Error with Heavy-Tailed Data (Cauchy):")
    print("-" * 40)
    print(f"t-test: {t_reject/n_sims:.3f} (nominal: 0.05)")
    print(f"Mann-Whitney: {mw_reject/n_sims:.3f}")
    print()
    print("The t-test's Type I error is inflated because")
    print("the Cauchy distribution has no finite mean/variance.")


rank_test_advantages()

M-Estimators (Huber)

The Huber Estimator

def huber_estimator(data, k=1.345, max_iter=100, tol=1e-6):
    """
    Huber M-estimator of location.
    """
    # Initial estimate (median)
    mu = np.median(data)
    mad = np.median(np.abs(data - mu))
    scale = mad * 1.4826  # Consistent estimate of sigma

    for iteration in range(max_iter):
        # Calculate weights
        residuals = (data - mu) / scale
        weights = np.where(np.abs(residuals) <= k, 1, k / np.abs(residuals))

        # Weighted mean
        mu_new = np.average(data, weights=weights)

        if abs(mu_new - mu) < tol:
            break
        mu = mu_new

    return {
        'estimate': mu,
        'iterations': iteration + 1,
        'scale': scale
    }


# Example
np.random.seed(42)
data = np.concatenate([
    np.random.normal(100, 10, 95),
    np.array([500, 600, -100, 700, -200])  # Severe outliers
])

huber_result = huber_estimator(data)

print("Huber M-Estimator:")
print("-" * 40)
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Huber estimate: {huber_result['estimate']:.2f}")
print(f"Iterations: {huber_result['iterations']}")

Bootstrap for Robust Inference

Bootstrap Trimmed Mean

def bootstrap_trimmed_mean_ci(data, trim=0.2, n_bootstrap=5000, alpha=0.05):
    """
    Bootstrap confidence interval for trimmed mean.
    """
    n = len(data)
    boot_means = []

    for _ in range(n_bootstrap):
        boot_sample = np.random.choice(data, size=n, replace=True)
        boot_means.append(trim_mean(boot_sample, trim))

    # Percentile CI
    ci_low = np.percentile(boot_means, 100 * alpha / 2)
    ci_high = np.percentile(boot_means, 100 * (1 - alpha / 2))

    return {
        'trimmed_mean': trim_mean(data, trim),
        'ci_low': ci_low,
        'ci_high': ci_high,
        'bootstrap_se': np.std(boot_means, ddof=1)
    }


# Example
np.random.seed(42)
data = np.concatenate([
    np.random.exponential(50, 80),
    np.array([500, 600, 700])  # Outliers
])

boot_result = bootstrap_trimmed_mean_ci(data, trim=0.2)

print("Bootstrap CI for Trimmed Mean:")
print("-" * 40)
print(f"Standard mean: {np.mean(data):.2f}")
print(f"Trimmed mean (20%): {boot_result['trimmed_mean']:.2f}")
print(f"95% Bootstrap CI: [{boot_result['ci_low']:.2f}, {boot_result['ci_high']:.2f}]")
print(f"Bootstrap SE: {boot_result['bootstrap_se']:.2f}")

Practical Workflow

def robust_analysis_workflow(group1, group2, names=('Group 1', 'Group 2')):
    """
    Complete robust analysis workflow.
    """
    print("=" * 60)
    print("ROBUST ANALYSIS WORKFLOW")
    print("=" * 60)

    # 1. Descriptives with robust measures
    print("\n1. DESCRIPTIVE STATISTICS")
    print("-" * 40)
    for name, data in [(names[0], group1), (names[1], group2)]:
        print(f"\n{name}:")
        print(f"  n: {len(data)}")
        print(f"  Mean: {np.mean(data):.2f}")
        print(f"  Median: {np.median(data):.2f}")
        print(f"  Trimmed Mean (20%): {trim_mean(data, 0.2):.2f}")
        print(f"  SD: {np.std(data, ddof=1):.2f}")
        print(f"  MAD: {stats.median_abs_deviation(data):.2f}")

    # 2. Check for outliers
    print("\n\n2. OUTLIER CHECK")
    print("-" * 40)
    for name, data in [(names[0], group1), (names[1], group2)]:
        z_scores = np.abs(stats.zscore(data))
        n_outliers = np.sum(z_scores > 3)
        print(f"{name}: {n_outliers} potential outliers (|z| > 3)")

    # 3. Compare methods
    print("\n\n3. ANALYSIS COMPARISON")
    print("-" * 40)

    # Standard t-test
    t_stat, p_t = stats.ttest_ind(group1, group2)
    mean_diff = np.mean(group1) - np.mean(group2)

    # Yuen's test
    yuen = yuen_test(group1, group2, trim=0.2)

    # Mann-Whitney
    u_stat, p_mw = stats.mannwhitneyu(group1, group2, alternative='two-sided')

    print(f"\n{'Method':<25} {'Statistic':>12} {'p-value':>12}")
    print("-" * 50)
    print(f"{'Standard t-test':<25} {t_stat:>12.3f} {p_t:>12.4f}")
    print(f"{'Yuens test (20% trim)':<25} {yuen['t_statistic']:>12.3f} {yuen['p_value']:>12.4f}")
    print(f"{'Mann-Whitney U':<25} {u_stat:>12.0f} {p_mw:>12.4f}")

    # 4. Recommendation
    print("\n\n4. EFFECT SIZE")
    print("-" * 40)
    print(f"Mean difference: {mean_diff:.2f}")
    print(f"Trimmed mean difference: {yuen['difference']:.2f}")
    print(f"Median difference: {np.median(group1) - np.median(group2):.2f}")

    print("\n" + "=" * 60)


# Example
np.random.seed(42)
control = np.concatenate([np.random.normal(100, 15, 45), [250, 300, 280, 350, 400]])
treatment = np.concatenate([np.random.normal(110, 15, 45), [280, 320, 350, 290, 380]])

robust_analysis_workflow(control, treatment, ('Control', 'Treatment'))

R Implementation

# Robust analysis in R

library(WRS2)  # Wilcox Robust Statistics

robust_workflow <- function(group1, group2) {
  cat("ROBUST ANALYSIS\n")
  cat(rep("=", 50), "\n\n")

  # Descriptives
  cat("Descriptives:\n")
  cat(sprintf("Group 1: Mean=%.2f, Median=%.2f, Trimmed=%.2f\n",
              mean(group1), median(group1), mean(group1, trim=0.2)))
  cat(sprintf("Group 2: Mean=%.2f, Median=%.2f, Trimmed=%.2f\n",
              mean(group2), median(group2), mean(group2, trim=0.2)))

  # Standard t-test
  t_result <- t.test(group1, group2)

  # Yuen's test (from WRS2)
  df <- data.frame(
    value = c(group1, group2),
    group = factor(rep(1:2, c(length(group1), length(group2))))
  )
  yuen_result <- yuen(value ~ group, data = df, tr = 0.2)

  # Mann-Whitney
  mw_result <- wilcox.test(group1, group2)

  cat("\nResults:\n")
  cat(sprintf("t-test: t=%.3f, p=%.4f\n", t_result$statistic, t_result$p.value))
  cat(sprintf("Yuen's: t=%.3f, p=%.4f\n", yuen_result$test, yuen_result$p.value))
  cat(sprintf("Mann-Whitney: W=%.0f, p=%.4f\n",
              mw_result$statistic, mw_result$p.value))
}

# Usage:
# group1 <- c(rnorm(45, 100, 15), 250, 300, 280, 350, 400)
# group2 <- c(rnorm(45, 110, 15), 280, 320, 350, 290, 380)
# robust_workflow(group1, group2)

Assumption Checks Master Guide — The pillar article
Handling Outliers — Outlier strategies
Bootstrap CIs — Bootstrap methods
Non-Normal Metrics — Approaches for skewed data

Key Takeaway

Robust methods are cheap insurance against assumption violations. Trimmed means with 20% trimming provide excellent protection against outliers while maintaining high efficiency when data is clean. Yuen's test extends this to two-group comparisons. When in doubt, compare robust and standard results—if they agree, you can be more confident in your conclusions.

References

https://www.jstor.org/stable/2289615
https://psycnet.apa.org/record/2012-10073-001
Wilcox, R. R. (2017). *Introduction to Robust Estimation and Hypothesis Testing* (4th ed.). Academic Press.
Huber, P. J., & Ronchetti, E. M. (2009). *Robust Statistics* (2nd ed.). Wiley.
Keselman, H. J., Wilcox, R. R., Othman, A. R., & Fradette, K. (2002). Trimming, transforming statistics, and bootstrapping: Circumventing the biasing effects of heteroscedasticity and nonnormality. *Journal of Modern Applied Statistical Methods*, 1(2), 288-309.

Frequently Asked Questions

When should I use robust methods?

When you have outliers, heavy-tailed distributions, or small samples where normality can't be verified. Robust methods provide insurance against assumption violations with minimal cost when assumptions hold.

What's the difference between trimming and Winsorizing?

Trimming removes extreme values entirely. Winsorizing replaces them with the nearest non-extreme value. Both reduce outlier influence, but Winsorizing preserves sample size.

Do robust methods require special software?

Basic trimmed means are in most packages. For Yuen's test and bootstrap trimmed mean inference, you may need specialized packages (WRS2 in R, scipy.stats.trim_mean in Python).

Key Takeaway

Robust methods protect against outliers and non-normality with minimal cost when assumptions hold. Trimmed means (ignoring extremes) and Winsorized means (replacing extremes) are intuitive and effective. When in doubt, use robust methods—they rarely hurt and often help.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email