Distributions

Bootstrap for Heavy-Tailed Metrics: Best Practices and Gotchas

How to use bootstrap correctly for revenue, engagement, and other heavy-tailed metrics. Learn about BCa intervals, when bootstrap fails, and how many resamples you actually need.

Share

Quick Hits

  • Bootstrap isn't magic—it inherits problems from your sample
  • BCa (bias-corrected accelerated) intervals outperform percentile method for skewed data
  • Heavy tails need more resamples: 5000+ for CI bounds, 10000+ for p-values
  • Small samples + heavy tails = unreliable bootstrap (need n > 50-100)
  • Always check bootstrap distribution shape—if bimodal or weird, investigate

TL;DR

Bootstrap is the go-to method for inference on heavy-tailed metrics like revenue—it makes no distributional assumptions and handles skew naturally. But bootstrap has limitations: it requires your sample to represent the population, struggles with small samples and extreme tails, and needs careful implementation (BCa intervals, enough resamples). This guide covers best practices for bootstrapping heavy-tailed product metrics.


Bootstrap Basics Review

The Core Idea

  1. Treat your sample as the population
  2. Resample with replacement (same size as original)
  3. Compute your statistic on each resample
  4. Use the distribution of resampled statistics for inference
import numpy as np


def simple_bootstrap(data, statistic_fn, n_bootstrap=2000):
    """
    Basic bootstrap for any statistic.

    Parameters:
    -----------
    data : array-like
        Original sample
    statistic_fn : callable
        Function to compute statistic (e.g., np.mean)
    n_bootstrap : int
        Number of resamples

    Returns:
    --------
    array of bootstrap statistics
    """
    data = np.asarray(data)
    n = len(data)

    bootstrap_stats = []
    for _ in range(n_bootstrap):
        resample = np.random.choice(data, size=n, replace=True)
        bootstrap_stats.append(statistic_fn(resample))

    return np.array(bootstrap_stats)


# Example
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 500)

boot_means = simple_bootstrap(revenue, np.mean, n_bootstrap=2000)

print(f"Original mean: ${np.mean(revenue):.2f}")
print(f"Bootstrap SE: ${np.std(boot_means):.2f}")
print(f"95% CI (percentile): (${np.percentile(boot_means, 2.5):.2f}, ${np.percentile(boot_means, 97.5):.2f})")

Why Bootstrap for Heavy Tails?

Standard Methods Fail

For heavy-tailed data:

  • CLT convergence is slow: Sample mean isn't normally distributed
  • Variance estimates are unstable: SE depends heavily on extremes
  • Parametric assumptions are wrong: Data isn't normal

Bootstrap Advantages

  1. No distributional assumptions: Works with any shape
  2. Handles skew naturally: CI can be asymmetric
  3. Works for any statistic: Mean, median, quantiles, ratios
  4. Captures sampling variability: Including from rare extreme values

Bootstrap Limitations

  1. Needs representative sample: Can't create variability not in data
  2. Requires adequate sample size: Small n = unreliable
  3. Computationally intensive: Thousands of resamples needed
  4. Not magic: Inherits sample biases

Confidence Interval Methods

Percentile Method

Simplest approach: CI bounds are percentiles of bootstrap distribution.

def percentile_ci(bootstrap_stats, alpha=0.05):
    """Simple percentile confidence interval."""
    lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    return lower, upper

Pros: Simple, intuitive Cons: Can have poor coverage for skewed statistics

BCa (Bias-Corrected and Accelerated)

Corrects for:

  • Bias: When bootstrap distribution center ≠ original estimate
  • Acceleration: When SE varies with parameter value
from scipy import stats


def bca_ci(data, statistic_fn, n_bootstrap=2000, alpha=0.05):
    """
    BCa bootstrap confidence interval.

    More accurate than percentile method for skewed distributions.
    """
    data = np.asarray(data)
    n = len(data)

    # Original estimate
    theta_hat = statistic_fn(data)

    # Bootstrap distribution
    theta_boot = []
    for _ in range(n_bootstrap):
        resample = np.random.choice(data, size=n, replace=True)
        theta_boot.append(statistic_fn(resample))
    theta_boot = np.array(theta_boot)

    # Bias correction: z0
    prop_less = np.mean(theta_boot < theta_hat)
    z0 = stats.norm.ppf(prop_less) if 0 < prop_less < 1 else 0

    # Acceleration: a (from jackknife)
    theta_jack = np.array([
        statistic_fn(np.delete(data, i)) for i in range(n)
    ])
    theta_jack_mean = theta_jack.mean()
    num = np.sum((theta_jack_mean - theta_jack) ** 3)
    denom = 6 * (np.sum((theta_jack_mean - theta_jack) ** 2) ** 1.5)
    a = num / denom if denom != 0 else 0

    # Adjusted percentiles
    z_alpha_lower = stats.norm.ppf(alpha / 2)
    z_alpha_upper = stats.norm.ppf(1 - alpha / 2)

    def adjusted_percentile(z_alpha):
        numerator = z0 + z_alpha
        denominator = 1 - a * numerator
        return stats.norm.cdf(z0 + numerator / denominator)

    p_lower = adjusted_percentile(z_alpha_lower)
    p_upper = adjusted_percentile(z_alpha_upper)

    # Bounds from bootstrap distribution
    ci_lower = np.percentile(theta_boot, 100 * p_lower)
    ci_upper = np.percentile(theta_boot, 100 * p_upper)

    return {
        'estimate': theta_hat,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'bias_correction': z0,
        'acceleration': a,
        'bootstrap_se': np.std(theta_boot)
    }


# Compare methods on skewed data
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 200)

# Percentile
boot_stats = simple_bootstrap(revenue, np.mean, n_bootstrap=5000)
pct_ci = percentile_ci(boot_stats)

# BCa
bca_result = bca_ci(revenue, np.mean, n_bootstrap=5000)

print("Confidence Interval Comparison")
print("=" * 50)
print(f"Sample mean: ${np.mean(revenue):.2f}")
print(f"\nPercentile 95% CI: (${pct_ci[0]:.2f}, ${pct_ci[1]:.2f})")
print(f"  Width: ${pct_ci[1] - pct_ci[0]:.2f}")
print(f"\nBCa 95% CI: (${bca_result['ci_lower']:.2f}, ${bca_result['ci_upper']:.2f})")
print(f"  Width: ${bca_result['ci_upper'] - bca_result['ci_lower']:.2f}")
print(f"  Bias correction (z0): {bca_result['bias_correction']:.3f}")
print(f"  Acceleration (a): {bca_result['acceleration']:.4f}")

How Many Bootstrap Samples?

Guidelines

Purpose Minimum Recommended Heavy Tails
Standard error 200 500 1000
95% CI bounds 1000 2000 5000
99% CI bounds 2000 5000 10000
p-values 5000 10000 20000

Monte Carlo Error

Bootstrap estimates are random. More resamples = less Monte Carlo error.

def bootstrap_stability_check(data, statistic_fn, resample_sizes=[500, 1000, 2000, 5000, 10000]):
    """
    Check how stable bootstrap CI is across different resample sizes.
    """
    results = []

    for B in resample_sizes:
        # Run 10 times to see variability
        cis = []
        for _ in range(10):
            boot_stats = simple_bootstrap(data, statistic_fn, n_bootstrap=B)
            ci = percentile_ci(boot_stats)
            cis.append(ci)

        cis = np.array(cis)
        lower_std = cis[:, 0].std()
        upper_std = cis[:, 1].std()

        results.append({
            'B': B,
            'lower_bound_sd': lower_std,
            'upper_bound_sd': upper_std
        })

    return results


# Example: Heavy-tailed revenue
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 300)

stability = bootstrap_stability_check(revenue, np.mean)
print("Bootstrap CI Stability (SD of bounds across 10 runs)")
print("=" * 50)
print(f"{'B':<10} {'Lower SD':<15} {'Upper SD':<15}")
print("-" * 50)
for r in stability:
    print(f"{r['B']:<10} ${r['lower_bound_sd']:<14.2f} ${r['upper_bound_sd']:<14.2f}")

When Bootstrap Fails

Problem 1: Sample Doesn't Represent Population

If your sample misses rare extreme values, bootstrap can't recover them.

def demonstrate_missing_extremes(true_sigma=2.0, sample_size=100, n_simulations=500):
    """
    Show how bootstrap misses extreme values in small samples.
    """
    np.random.seed(42)

    coverage = []
    ci_widths = []

    for _ in range(n_simulations):
        # Generate heavy-tailed data
        sample = np.random.lognormal(3, true_sigma, sample_size)

        # True population mean (analytical)
        true_mean = np.exp(3 + true_sigma**2 / 2)

        # Bootstrap CI
        result = bca_ci(sample, np.mean, n_bootstrap=2000)

        # Check coverage
        covered = result['ci_lower'] <= true_mean <= result['ci_upper']
        coverage.append(covered)
        ci_widths.append(result['ci_upper'] - result['ci_lower'])

    return {
        'coverage': np.mean(coverage) * 100,
        'mean_width': np.mean(ci_widths),
        'true_mean': np.exp(3 + true_sigma**2 / 2)
    }


# Compare coverage at different sample sizes
print("Bootstrap Coverage by Sample Size (Heavy-Tailed Data)")
print("=" * 60)
print(f"{'n':<10} {'Coverage':<15} {'Mean CI Width':<20}")
print("-" * 60)

for n in [50, 100, 200, 500, 1000]:
    result = demonstrate_missing_extremes(true_sigma=2.0, sample_size=n, n_simulations=300)
    print(f"{n:<10} {result['coverage']:.1f}%{'':<10} ${result['mean_width']:.2f}")

print(f"\nTrue mean: ${np.exp(3 + 2**2/2):.2f}")
print("Note: 95% CI should cover 95% of the time. Undercoverage indicates bootstrap failure.")

Problem 2: Non-Smooth Statistics

Bootstrap struggles with:

  • Maximum/minimum: Discrete bootstrap distribution
  • Extreme quantiles with small n: Few observations in region
  • Discontinuous statistics: Mode, indicators
# Example: Bootstrap for max is problematic
np.random.seed(42)
data = np.random.lognormal(3, 1, 100)

boot_max = simple_bootstrap(data, np.max, n_bootstrap=2000)

print("Bootstrap for Maximum (Problematic)")
print("=" * 50)
print(f"Sample max: ${np.max(data):.2f}")
print(f"Bootstrap distribution of max:")
print(f"  Unique values: {len(np.unique(boot_max))}")
print(f"  (Should be continuous for valid CI)")

Problem 3: Very Small Samples

Rule of thumb: n > 50 for basic bootstrap, n > 100 for heavy tails.


Comparing Groups with Bootstrap

Two-Sample Bootstrap

def bootstrap_two_sample(group1, group2, statistic='mean_diff', n_bootstrap=5000):
    """
    Bootstrap comparison of two groups.

    Parameters:
    -----------
    group1, group2 : arrays
        Data for each group
    statistic : str
        'mean_diff', 'ratio', or 'lift'
    n_bootstrap : int
        Number of resamples

    Returns:
    --------
    dict with estimate, CI, and p-value
    """
    g1 = np.asarray(group1)
    g2 = np.asarray(group2)

    # Point estimates
    if statistic == 'mean_diff':
        point_est = np.mean(g2) - np.mean(g1)
    elif statistic == 'ratio':
        point_est = np.mean(g2) / np.mean(g1)
    elif statistic == 'lift':
        point_est = (np.mean(g2) - np.mean(g1)) / np.mean(g1)

    # Bootstrap
    boot_stats = []
    for _ in range(n_bootstrap):
        boot_g1 = np.random.choice(g1, len(g1), replace=True)
        boot_g2 = np.random.choice(g2, len(g2), replace=True)

        if statistic == 'mean_diff':
            boot_stat = np.mean(boot_g2) - np.mean(boot_g1)
        elif statistic == 'ratio':
            boot_stat = np.mean(boot_g2) / np.mean(boot_g1) if np.mean(boot_g1) > 0 else np.nan
        elif statistic == 'lift':
            boot_stat = (np.mean(boot_g2) - np.mean(boot_g1)) / np.mean(boot_g1) if np.mean(boot_g1) > 0 else np.nan

        if not np.isnan(boot_stat):
            boot_stats.append(boot_stat)

    boot_stats = np.array(boot_stats)

    # CI
    ci = np.percentile(boot_stats, [2.5, 97.5])

    # P-value (two-sided test against null of 0 difference or ratio of 1)
    null_value = 0 if statistic in ['mean_diff', 'lift'] else 1
    p_value = 2 * min(
        np.mean(boot_stats <= null_value),
        np.mean(boot_stats >= null_value)
    )

    return {
        'estimate': point_est,
        'ci_lower': ci[0],
        'ci_upper': ci[1],
        'p_value': min(p_value, 1.0),
        'bootstrap_se': np.std(boot_stats)
    }


# Example: Revenue comparison
np.random.seed(42)

control = np.random.lognormal(3, 1.2, 1500)
treatment = np.random.lognormal(3.1, 1.2, 1500)  # 10% higher geometric mean

print("Bootstrap Two-Sample Comparison")
print("=" * 60)

for stat in ['mean_diff', 'ratio', 'lift']:
    result = bootstrap_two_sample(control, treatment, statistic=stat, n_bootstrap=5000)
    print(f"\n{stat.replace('_', ' ').title()}:")
    if stat == 'mean_diff':
        print(f"  Estimate: ${result['estimate']:.2f}")
        print(f"  95% CI: (${result['ci_lower']:.2f}, ${result['ci_upper']:.2f})")
    elif stat == 'ratio':
        print(f"  Estimate: {result['estimate']:.3f}")
        print(f"  95% CI: ({result['ci_lower']:.3f}, {result['ci_upper']:.3f})")
    else:
        print(f"  Estimate: {result['estimate']:.1%}")
        print(f"  95% CI: ({result['ci_lower']:.1%}, {result['ci_upper']:.1%})")
    print(f"  p-value: {result['p_value']:.4f}")

Visualizing Bootstrap Results

import matplotlib.pyplot as plt


def plot_bootstrap_diagnostics(data, statistic_fn, statistic_name='Statistic', n_bootstrap=5000):
    """
    Diagnostic plots for bootstrap.
    """
    boot_stats = simple_bootstrap(data, statistic_fn, n_bootstrap)
    point_est = statistic_fn(data)

    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    # 1. Bootstrap distribution
    ax1 = axes[0]
    ax1.hist(boot_stats, bins=50, edgecolor='white', alpha=0.7)
    ax1.axvline(point_est, color='red', linestyle='--', label='Original estimate')
    ax1.axvline(np.mean(boot_stats), color='blue', linestyle=':', label='Bootstrap mean')

    ci = np.percentile(boot_stats, [2.5, 97.5])
    ax1.axvline(ci[0], color='green', linestyle='-', label='95% CI')
    ax1.axvline(ci[1], color='green', linestyle='-')

    ax1.set_xlabel(statistic_name)
    ax1.set_ylabel('Count')
    ax1.set_title('Bootstrap Distribution')
    ax1.legend()

    # 2. Q-Q plot against normal
    ax2 = axes[1]
    from scipy import stats
    stats.probplot(boot_stats, dist="norm", plot=ax2)
    ax2.set_title('Q-Q Plot (vs. Normal)')

    # 3. Running CI stability
    ax3 = axes[2]
    cumulative_lower = []
    cumulative_upper = []
    check_points = range(100, n_bootstrap + 1, 100)

    for b in check_points:
        ci = np.percentile(boot_stats[:b], [2.5, 97.5])
        cumulative_lower.append(ci[0])
        cumulative_upper.append(ci[1])

    ax3.plot(check_points, cumulative_lower, label='Lower bound')
    ax3.plot(check_points, cumulative_upper, label='Upper bound')
    ax3.axhline(np.percentile(boot_stats, 2.5), color='grey', linestyle='--', alpha=0.5)
    ax3.axhline(np.percentile(boot_stats, 97.5), color='grey', linestyle='--', alpha=0.5)
    ax3.set_xlabel('Number of Bootstrap Samples')
    ax3.set_ylabel('CI Bound')
    ax3.set_title('CI Stability')
    ax3.legend()

    plt.tight_layout()
    return fig


# Example
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 500)

fig = plot_bootstrap_diagnostics(revenue, np.mean, 'Mean Revenue ($)', n_bootstrap=5000)
plt.show()

Best Practices Checklist

Before Bootstrapping

  • Sample size adequate (n > 50, ideally > 100)
  • Sample represents population (no selection bias)
  • Statistic is smooth (not max/min/mode)
  • Data is independent (or account for clustering)

During Bootstrapping

  • Use enough resamples (5000+ for CI, 10000+ for p-values)
  • Use BCa for skewed statistics
  • Resample at correct level (user, not observation if clustered)

After Bootstrapping

  • Visualize bootstrap distribution (should be smooth)
  • Check CI stability across resample sizes
  • Compare with other methods as sanity check
  • Report bootstrap SE and CI method used

R Implementation

library(boot)

# Basic bootstrap
boot_fn <- function(data, indices) {
    mean(data[indices])
}

boot_result <- boot(revenue, boot_fn, R = 5000)

# BCa confidence interval
boot.ci(boot_result, type = "bca")

# Two-sample comparison
boot_diff <- function(data, indices) {
    d <- data[indices, ]
    mean(d$revenue[d$group == "treatment"]) -
        mean(d$revenue[d$group == "control"])
}

combined <- data.frame(
    revenue = c(control, treatment),
    group = rep(c("control", "treatment"), c(length(control), length(treatment)))
)

boot_result <- boot(combined, boot_diff, R = 5000)
boot.ci(boot_result, type = "bca")


Key Takeaway

Bootstrap is essential for heavy-tailed metrics—it handles skew naturally and makes no distributional assumptions. But bootstrap isn't magic: it needs adequate sample sizes, enough resamples (5000+ for heavy tails), and appropriate CI methods (BCa for skewed statistics). Always visualize your bootstrap distribution and check stability. When your sample is small or might miss rare extremes, bootstrap CIs will be too narrow—consider this when interpreting results. Used properly, bootstrap provides reliable inference for the difficult metrics that standard methods can't handle.


References

  1. https://doi.org/10.1214/ss/1177013815
  2. https://www.jstor.org/stable/2290951
  3. https://doi.org/10.1111/j.1467-9868.2010.00753.x
  4. Efron, B. (1987). Better bootstrap confidence intervals. *Journal of the American Statistical Association*, 82(397), 171-185.
  5. DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. *Statistical Science*, 11(3), 189-228.
  6. Davison, A. C., & Hinkley, D. V. (1997). *Bootstrap Methods and their Application*. Cambridge University Press.

Frequently Asked Questions

How many bootstrap samples do I need?
For standard errors: 200-500. For 95% CI bounds: 2000-5000. For p-values: 10000+. Heavy-tailed data benefits from more resamples because the bootstrap distribution is itself variable. When in doubt, use more.
Does bootstrap work for any distribution?
Bootstrap requires the sample to represent the population. For heavy-tailed data with small samples, your sample may miss rare large values, making bootstrap CIs too narrow. It also struggles when the statistic isn't smooth (like max or certain quantiles with small n).
What's the difference between percentile and BCa bootstrap?
Percentile method just takes quantiles of the bootstrap distribution. BCa corrects for bias (when bootstrap mean ≠ original estimate) and acceleration (when SE varies with parameter value). BCa has better coverage for skewed distributions but requires more computation.

Key Takeaway

Bootstrap is the workhorse for heavy-tailed metrics, but it's not foolproof. Use BCa intervals for skewed data, increase resamples to 5000+ for reliable bounds, and be cautious with small samples where the bootstrap can't capture rare extreme values. Always visualize your bootstrap distribution and run sensitivity checks. Bootstrap inherits your sample's properties—if the sample is unrepresentative, so is the bootstrap.

Send to a friend

Share this with someone who loves clean statistical work.