Distributions

Bootstrap for Heavy-Tailed Metrics: Best Practices and Gotchas

How to use bootstrap correctly for revenue, engagement, and other heavy-tailed metrics. Learn about BCa intervals, when bootstrap fails, and how many resamples you actually need.

Jan 269 min readstatstest_flow Distributions Supporting

Bootstrap for Heavy-Tailed Metrics: Best Practices and Gotchas

Quick Hits

•Bootstrap isn't magic—it inherits problems from your sample
•BCa (bias-corrected accelerated) intervals outperform percentile method for skewed data
•Heavy tails need more resamples: 5000+ for CI bounds, 10000+ for p-values
•Small samples + heavy tails = unreliable bootstrap (need n > 50-100)
•Always check bootstrap distribution shape—if bimodal or weird, investigate

TL;DR

Bootstrap is the go-to method for inference on heavy-tailed metrics like revenue—it makes no distributional assumptions and handles skew naturally. But bootstrap has limitations: it requires your sample to represent the population, struggles with small samples and extreme tails, and needs careful implementation (BCa intervals, enough resamples). This guide covers best practices for bootstrapping heavy-tailed product metrics.

Bootstrap Basics Review

The Core Idea

Treat your sample as the population
Resample with replacement (same size as original)
Compute your statistic on each resample
Use the distribution of resampled statistics for inference

import numpy as np


def simple_bootstrap(data, statistic_fn, n_bootstrap=2000):
    """
    Basic bootstrap for any statistic.

    Parameters:
    -----------
    data : array-like
        Original sample
    statistic_fn : callable
        Function to compute statistic (e.g., np.mean)
    n_bootstrap : int
        Number of resamples

    Returns:
    --------
    array of bootstrap statistics
    """
    data = np.asarray(data)
    n = len(data)

    bootstrap_stats = []
    for _ in range(n_bootstrap):
        resample = np.random.choice(data, size=n, replace=True)
        bootstrap_stats.append(statistic_fn(resample))

    return np.array(bootstrap_stats)


# Example
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 500)

boot_means = simple_bootstrap(revenue, np.mean, n_bootstrap=2000)

print(f"Original mean: ${np.mean(revenue):.2f}")
print(f"Bootstrap SE: ${np.std(boot_means):.2f}")
print(f"95% CI (percentile): (${np.percentile(boot_means, 2.5):.2f}, ${np.percentile(boot_means, 97.5):.2f})")

Why Bootstrap for Heavy Tails?

Standard Methods Fail

For heavy-tailed data:

CLT convergence is slow: Sample mean isn't normally distributed
Variance estimates are unstable: SE depends heavily on extremes
Parametric assumptions are wrong: Data isn't normal

Bootstrap Advantages

No distributional assumptions: Works with any shape
Handles skew naturally: CI can be asymmetric
Works for any statistic: Mean, median, quantiles, ratios
Captures sampling variability: Including from rare extreme values

Bootstrap Limitations

Needs representative sample: Can't create variability not in data
Requires adequate sample size: Small n = unreliable
Computationally intensive: Thousands of resamples needed
Not magic: Inherits sample biases

Confidence Interval Methods

Percentile Method

Simplest approach: CI bounds are percentiles of bootstrap distribution.

def percentile_ci(bootstrap_stats, alpha=0.05):
    """Simple percentile confidence interval."""
    lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    return lower, upper

Pros: Simple, intuitive Cons: Can have poor coverage for skewed statistics

BCa (Bias-Corrected and Accelerated)

Corrects for:

Bias: When bootstrap distribution center $\\neq$ original estimate
Acceleration: When SE varies with parameter value

from scipy import stats


def bca_ci(data, statistic_fn, n_bootstrap=2000, alpha=0.05):
    """
    BCa bootstrap confidence interval.

    More accurate than percentile method for skewed distributions.
    """
    data = np.asarray(data)
    n = len(data)

    # Original estimate
    theta_hat = statistic_fn(data)

    # Bootstrap distribution
    theta_boot = []
    for _ in range(n_bootstrap):
        resample = np.random.choice(data, size=n, replace=True)
        theta_boot.append(statistic_fn(resample))
    theta_boot = np.array(theta_boot)

    # Bias correction: z0
    prop_less = np.mean(theta_boot < theta_hat)
    z0 = stats.norm.ppf(prop_less) if 0 < prop_less < 1 else 0

    # Acceleration: a (from jackknife)
    theta_jack = np.array([
        statistic_fn(np.delete(data, i)) for i in range(n)
    ])
    theta_jack_mean = theta_jack.mean()
    num = np.sum((theta_jack_mean - theta_jack) ** 3)
    denom = 6 * (np.sum((theta_jack_mean - theta_jack) ** 2) ** 1.5)
    a = num / denom if denom != 0 else 0

    # Adjusted percentiles
    z_alpha_lower = stats.norm.ppf(alpha / 2)
    z_alpha_upper = stats.norm.ppf(1 - alpha / 2)

    def adjusted_percentile(z_alpha):
        numerator = z0 + z_alpha
        denominator = 1 - a * numerator
        return stats.norm.cdf(z0 + numerator / denominator)

    p_lower = adjusted_percentile(z_alpha_lower)
    p_upper = adjusted_percentile(z_alpha_upper)

    # Bounds from bootstrap distribution
    ci_lower = np.percentile(theta_boot, 100 * p_lower)
    ci_upper = np.percentile(theta_boot, 100 * p_upper)

    return {
        'estimate': theta_hat,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'bias_correction': z0,
        'acceleration': a,
        'bootstrap_se': np.std(theta_boot)
    }


# Compare methods on skewed data
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 200)

# Percentile
boot_stats = simple_bootstrap(revenue, np.mean, n_bootstrap=5000)
pct_ci = percentile_ci(boot_stats)

# BCa
bca_result = bca_ci(revenue, np.mean, n_bootstrap=5000)

print("Confidence Interval Comparison")
print("=" * 50)
print(f"Sample mean: ${np.mean(revenue):.2f}")
print(f"\nPercentile 95% CI: (${pct_ci[0]:.2f}, ${pct_ci[1]:.2f})")
print(f"  Width: ${pct_ci[1] - pct_ci[0]:.2f}")
print(f"\nBCa 95% CI: (${bca_result['ci_lower']:.2f}, ${bca_result['ci_upper']:.2f})")
print(f"  Width: ${bca_result['ci_upper'] - bca_result['ci_lower']:.2f}")
print(f"  Bias correction (z0): {bca_result['bias_correction']:.3f}")
print(f"  Acceleration (a): {bca_result['acceleration']:.4f}")

How Many Bootstrap Samples?

Guidelines

Purpose	Minimum	Recommended	Heavy Tails
Standard error	200	500	1000
95% CI bounds	1000	2000	5000
99% CI bounds	2000	5000	10000
p-values	5000	10000	20000

Monte Carlo Error

Bootstrap estimates are random. More resamples = less Monte Carlo error.

def bootstrap_stability_check(data, statistic_fn, resample_sizes=[500, 1000, 2000, 5000, 10000]):
    """
    Check how stable bootstrap CI is across different resample sizes.
    """
    results = []

    for B in resample_sizes:
        # Run 10 times to see variability
        cis = []
        for _ in range(10):
            boot_stats = simple_bootstrap(data, statistic_fn, n_bootstrap=B)
            ci = percentile_ci(boot_stats)
            cis.append(ci)

        cis = np.array(cis)
        lower_std = cis[:, 0].std()
        upper_std = cis[:, 1].std()

        results.append({
            'B': B,
            'lower_bound_sd': lower_std,
            'upper_bound_sd': upper_std
        })

    return results


# Example: Heavy-tailed revenue
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 300)

stability = bootstrap_stability_check(revenue, np.mean)
print("Bootstrap CI Stability (SD of bounds across 10 runs)")
print("=" * 50)
print(f"{'B':<10} {'Lower SD':<15} {'Upper SD':<15}")
print("-" * 50)
for r in stability:
    print(f"{r['B']:<10} ${r['lower_bound_sd']:<14.2f} ${r['upper_bound_sd']:<14.2f}")

When Bootstrap Fails

Problem 1: Sample Doesn't Represent Population

If your sample misses rare extreme values, bootstrap can't recover them.

def demonstrate_missing_extremes(true_sigma=2.0, sample_size=100, n_simulations=500):
    """
    Show how bootstrap misses extreme values in small samples.
    """
    np.random.seed(42)

    coverage = []
    ci_widths = []

    for _ in range(n_simulations):
        # Generate heavy-tailed data
        sample = np.random.lognormal(3, true_sigma, sample_size)

        # True population mean (analytical)
        true_mean = np.exp(3 + true_sigma**2 / 2)

        # Bootstrap CI
        result = bca_ci(sample, np.mean, n_bootstrap=2000)

        # Check coverage
        covered = result['ci_lower'] <= true_mean <= result['ci_upper']
        coverage.append(covered)
        ci_widths.append(result['ci_upper'] - result['ci_lower'])

    return {
        'coverage': np.mean(coverage) * 100,
        'mean_width': np.mean(ci_widths),
        'true_mean': np.exp(3 + true_sigma**2 / 2)
    }


# Compare coverage at different sample sizes
print("Bootstrap Coverage by Sample Size (Heavy-Tailed Data)")
print("=" * 60)
print(f"{'n':<10} {'Coverage':<15} {'Mean CI Width':<20}")
print("-" * 60)

for n in [50, 100, 200, 500, 1000]:
    result = demonstrate_missing_extremes(true_sigma=2.0, sample_size=n, n_simulations=300)
    print(f"{n:<10} {result['coverage']:.1f}%{'':<10} ${result['mean_width']:.2f}")

print(f"\nTrue mean: ${np.exp(3 + 2**2/2):.2f}")
print("Note: 95% CI should cover 95% of the time. Undercoverage indicates bootstrap failure.")

Problem 2: Non-Smooth Statistics

Bootstrap struggles with:

Maximum/minimum: Discrete bootstrap distribution
Extreme quantiles with small n: Few observations in region
Discontinuous statistics: Mode, indicators

# Example: Bootstrap for max is problematic
np.random.seed(42)
data = np.random.lognormal(3, 1, 100)

boot_max = simple_bootstrap(data, np.max, n_bootstrap=2000)

print("Bootstrap for Maximum (Problematic)")
print("=" * 50)
print(f"Sample max: ${np.max(data):.2f}")
print(f"Bootstrap distribution of max:")
print(f"  Unique values: {len(np.unique(boot_max))}")
print(f"  (Should be continuous for valid CI)")

Problem 3: Very Small Samples

Rule of thumb: n > 50 for basic bootstrap, n > 100 for heavy tails.

Comparing Groups with Bootstrap

Two-Sample Bootstrap

def bootstrap_two_sample(group1, group2, statistic='mean_diff', n_bootstrap=5000):
    """
    Bootstrap comparison of two groups.

    Parameters:
    -----------
    group1, group2 : arrays
        Data for each group
    statistic : str
        'mean_diff', 'ratio', or 'lift'
    n_bootstrap : int
        Number of resamples

    Returns:
    --------
    dict with estimate, CI, and p-value
    """
    g1 = np.asarray(group1)
    g2 = np.asarray(group2)

    # Point estimates
    if statistic == 'mean_diff':
        point_est = np.mean(g2) - np.mean(g1)
    elif statistic == 'ratio':
        point_est = np.mean(g2) / np.mean(g1)
    elif statistic == 'lift':
        point_est = (np.mean(g2) - np.mean(g1)) / np.mean(g1)

    # Bootstrap
    boot_stats = []
    for _ in range(n_bootstrap):
        boot_g1 = np.random.choice(g1, len(g1), replace=True)
        boot_g2 = np.random.choice(g2, len(g2), replace=True)

        if statistic == 'mean_diff':
            boot_stat = np.mean(boot_g2) - np.mean(boot_g1)
        elif statistic == 'ratio':
            boot_stat = np.mean(boot_g2) / np.mean(boot_g1) if np.mean(boot_g1) > 0 else np.nan
        elif statistic == 'lift':
            boot_stat = (np.mean(boot_g2) - np.mean(boot_g1)) / np.mean(boot_g1) if np.mean(boot_g1) > 0 else np.nan

        if not np.isnan(boot_stat):
            boot_stats.append(boot_stat)

    boot_stats = np.array(boot_stats)

    # CI
    ci = np.percentile(boot_stats, [2.5, 97.5])

    # P-value (two-sided test against null of 0 difference or ratio of 1)
    null_value = 0 if statistic in ['mean_diff', 'lift'] else 1
    p_value = 2 * min(
        np.mean(boot_stats <= null_value),
        np.mean(boot_stats >= null_value)
    )

    return {
        'estimate': point_est,
        'ci_lower': ci[0],
        'ci_upper': ci[1],
        'p_value': min(p_value, 1.0),
        'bootstrap_se': np.std(boot_stats)
    }


# Example: Revenue comparison
np.random.seed(42)

control = np.random.lognormal(3, 1.2, 1500)
treatment = np.random.lognormal(3.1, 1.2, 1500)  # 10% higher geometric mean

print("Bootstrap Two-Sample Comparison")
print("=" * 60)

for stat in ['mean_diff', 'ratio', 'lift']:
    result = bootstrap_two_sample(control, treatment, statistic=stat, n_bootstrap=5000)
    print(f"\n{stat.replace('_', ' ').title()}:")
    if stat == 'mean_diff':
        print(f"  Estimate: ${result['estimate']:.2f}")
        print(f"  95% CI: (${result['ci_lower']:.2f}, ${result['ci_upper']:.2f})")
    elif stat == 'ratio':
        print(f"  Estimate: {result['estimate']:.3f}")
        print(f"  95% CI: ({result['ci_lower']:.3f}, {result['ci_upper']:.3f})")
    else:
        print(f"  Estimate: {result['estimate']:.1%}")
        print(f"  95% CI: ({result['ci_lower']:.1%}, {result['ci_upper']:.1%})")
    print(f"  p-value: {result['p_value']:.4f}")

Visualizing Bootstrap Results

import matplotlib.pyplot as plt


def plot_bootstrap_diagnostics(data, statistic_fn, statistic_name='Statistic', n_bootstrap=5000):
    """
    Diagnostic plots for bootstrap.
    """
    boot_stats = simple_bootstrap(data, statistic_fn, n_bootstrap)
    point_est = statistic_fn(data)

    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    # 1. Bootstrap distribution
    ax1 = axes[0]
    ax1.hist(boot_stats, bins=50, edgecolor='white', alpha=0.7)
    ax1.axvline(point_est, color='red', linestyle='--', label='Original estimate')
    ax1.axvline(np.mean(boot_stats), color='blue', linestyle=':', label='Bootstrap mean')

    ci = np.percentile(boot_stats, [2.5, 97.5])
    ax1.axvline(ci[0], color='green', linestyle='-', label='95% CI')
    ax1.axvline(ci[1], color='green', linestyle='-')

    ax1.set_xlabel(statistic_name)
    ax1.set_ylabel('Count')
    ax1.set_title('Bootstrap Distribution')
    ax1.legend()

    # 2. Q-Q plot against normal
    ax2 = axes[1]
    from scipy import stats
    stats.probplot(boot_stats, dist="norm", plot=ax2)
    ax2.set_title('Q-Q Plot (vs. Normal)')

    # 3. Running CI stability
    ax3 = axes[2]
    cumulative_lower = []
    cumulative_upper = []
    check_points = range(100, n_bootstrap + 1, 100)

    for b in check_points:
        ci = np.percentile(boot_stats[:b], [2.5, 97.5])
        cumulative_lower.append(ci[0])
        cumulative_upper.append(ci[1])

    ax3.plot(check_points, cumulative_lower, label='Lower bound')
    ax3.plot(check_points, cumulative_upper, label='Upper bound')
    ax3.axhline(np.percentile(boot_stats, 2.5), color='grey', linestyle='--', alpha=0.5)
    ax3.axhline(np.percentile(boot_stats, 97.5), color='grey', linestyle='--', alpha=0.5)
    ax3.set_xlabel('Number of Bootstrap Samples')
    ax3.set_ylabel('CI Bound')
    ax3.set_title('CI Stability')
    ax3.legend()

    plt.tight_layout()
    return fig


# Example
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 500)

fig = plot_bootstrap_diagnostics(revenue, np.mean, 'Mean Revenue ($)', n_bootstrap=5000)
plt.show()

Best Practices Checklist

Before Bootstrapping

Sample size adequate (n > 50, ideally > 100)
Sample represents population (no selection bias)
Statistic is smooth (not max/min/mode)
Data is independent (or account for clustering)

During Bootstrapping

Use enough resamples (5000+ for CI, 10000+ for p-values)
Use BCa for skewed statistics
Resample at correct level (user, not observation if clustered)

After Bootstrapping

Visualize bootstrap distribution (should be smooth)
Check CI stability across resample sizes
Compare with other methods as sanity check
Report bootstrap SE and CI method used

R Implementation

library(boot)

# Basic bootstrap
boot_fn <- function(data, indices) {
    mean(data[indices])
}

boot_result <- boot(revenue, boot_fn, R = 5000)

# BCa confidence interval
boot.ci(boot_result, type = "bca")

# Two-sample comparison
boot_diff <- function(data, indices) {
    d <- data[indices, ]
    mean(d$revenue[d$group == "treatment"]) -
        mean(d$revenue[d$group == "control"])
}

combined <- data.frame(
    revenue = c(control, treatment),
    group = rep(c("control", "treatment"), c(length(control), length(treatment)))
)

boot_result <- boot(combined, boot_diff, R = 5000)
boot.ci(boot_result, type = "bca")

Metric Distributions (Pillar) - Full distributions overview
Winsorization and Trimming - Outlier handling
Delta Method vs. Bootstrap - When to use which
Confidence Intervals for Non-Normal Metrics - CI deep dive

Key Takeaway

Bootstrap is essential for heavy-tailed metrics—it handles skew naturally and makes no distributional assumptions. But bootstrap isn't magic: it needs adequate sample sizes, enough resamples (5000+ for heavy tails), and appropriate CI methods (BCa for skewed statistics). Always visualize your bootstrap distribution and check stability. When your sample is small or might miss rare extremes, bootstrap CIs will be too narrow—consider this when interpreting results. Used properly, bootstrap provides reliable inference for the difficult metrics that standard methods can't handle.

References

https://doi.org/10.1214/ss/1177013815
https://www.jstor.org/stable/2290951
https://doi.org/10.1111/j.1467-9868.2010.00753.x
Efron, B. (1987). Better bootstrap confidence intervals. *Journal of the American Statistical Association*, 82(397), 171-185.
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. *Statistical Science*, 11(3), 189-228.
Davison, A. C., & Hinkley, D. V. (1997). *Bootstrap Methods and their Application*. Cambridge University Press.

Frequently Asked Questions

How many bootstrap samples do I need?

For standard errors: 200-500. For 95% CI bounds: 2000-5000. For p-values: 10000+. Heavy-tailed data benefits from more resamples because the bootstrap distribution is itself variable. When in doubt, use more.

Does bootstrap work for any distribution?

Bootstrap requires the sample to represent the population. For heavy-tailed data with small samples, your sample may miss rare large values, making bootstrap CIs too narrow. It also struggles when the statistic isn't smooth (like max or certain quantiles with small n).

What's the difference between percentile and BCa bootstrap?

Percentile method just takes quantiles of the bootstrap distribution. BCa corrects for bias (when bootstrap mean ≠ original estimate) and acceleration (when SE varies with parameter value). BCa has better coverage for skewed distributions but requires more computation.

Key Takeaway

Bootstrap is the workhorse for heavy-tailed metrics, but it's not foolproof. Use BCa intervals for skewed data, increase resamples to 5000+ for reliable bounds, and be cautious with small samples where the bootstrap can't capture rare extreme values. Always visualize your bootstrap distribution and run sensitivity checks. Bootstrap inherits your sample's properties—if the sample is unrepresentative, so is the bootstrap.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email