Contents
Bootstrap for Heavy-Tailed Metrics: Best Practices and Gotchas
How to use bootstrap correctly for revenue, engagement, and other heavy-tailed metrics. Learn about BCa intervals, when bootstrap fails, and how many resamples you actually need.
Quick Hits
- •Bootstrap isn't magic—it inherits problems from your sample
- •BCa (bias-corrected accelerated) intervals outperform percentile method for skewed data
- •Heavy tails need more resamples: 5000+ for CI bounds, 10000+ for p-values
- •Small samples + heavy tails = unreliable bootstrap (need n > 50-100)
- •Always check bootstrap distribution shape—if bimodal or weird, investigate
TL;DR
Bootstrap is the go-to method for inference on heavy-tailed metrics like revenue—it makes no distributional assumptions and handles skew naturally. But bootstrap has limitations: it requires your sample to represent the population, struggles with small samples and extreme tails, and needs careful implementation (BCa intervals, enough resamples). This guide covers best practices for bootstrapping heavy-tailed product metrics.
Bootstrap Basics Review
The Core Idea
- Treat your sample as the population
- Resample with replacement (same size as original)
- Compute your statistic on each resample
- Use the distribution of resampled statistics for inference
import numpy as np
def simple_bootstrap(data, statistic_fn, n_bootstrap=2000):
"""
Basic bootstrap for any statistic.
Parameters:
-----------
data : array-like
Original sample
statistic_fn : callable
Function to compute statistic (e.g., np.mean)
n_bootstrap : int
Number of resamples
Returns:
--------
array of bootstrap statistics
"""
data = np.asarray(data)
n = len(data)
bootstrap_stats = []
for _ in range(n_bootstrap):
resample = np.random.choice(data, size=n, replace=True)
bootstrap_stats.append(statistic_fn(resample))
return np.array(bootstrap_stats)
# Example
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 500)
boot_means = simple_bootstrap(revenue, np.mean, n_bootstrap=2000)
print(f"Original mean: ${np.mean(revenue):.2f}")
print(f"Bootstrap SE: ${np.std(boot_means):.2f}")
print(f"95% CI (percentile): (${np.percentile(boot_means, 2.5):.2f}, ${np.percentile(boot_means, 97.5):.2f})")
Why Bootstrap for Heavy Tails?
Standard Methods Fail
For heavy-tailed data:
- CLT convergence is slow: Sample mean isn't normally distributed
- Variance estimates are unstable: SE depends heavily on extremes
- Parametric assumptions are wrong: Data isn't normal
Bootstrap Advantages
- No distributional assumptions: Works with any shape
- Handles skew naturally: CI can be asymmetric
- Works for any statistic: Mean, median, quantiles, ratios
- Captures sampling variability: Including from rare extreme values
Bootstrap Limitations
- Needs representative sample: Can't create variability not in data
- Requires adequate sample size: Small n = unreliable
- Computationally intensive: Thousands of resamples needed
- Not magic: Inherits sample biases
Confidence Interval Methods
Percentile Method
Simplest approach: CI bounds are percentiles of bootstrap distribution.
def percentile_ci(bootstrap_stats, alpha=0.05):
"""Simple percentile confidence interval."""
lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
return lower, upper
Pros: Simple, intuitive Cons: Can have poor coverage for skewed statistics
BCa (Bias-Corrected and Accelerated)
Corrects for:
- Bias: When bootstrap distribution center ≠ original estimate
- Acceleration: When SE varies with parameter value
from scipy import stats
def bca_ci(data, statistic_fn, n_bootstrap=2000, alpha=0.05):
"""
BCa bootstrap confidence interval.
More accurate than percentile method for skewed distributions.
"""
data = np.asarray(data)
n = len(data)
# Original estimate
theta_hat = statistic_fn(data)
# Bootstrap distribution
theta_boot = []
for _ in range(n_bootstrap):
resample = np.random.choice(data, size=n, replace=True)
theta_boot.append(statistic_fn(resample))
theta_boot = np.array(theta_boot)
# Bias correction: z0
prop_less = np.mean(theta_boot < theta_hat)
z0 = stats.norm.ppf(prop_less) if 0 < prop_less < 1 else 0
# Acceleration: a (from jackknife)
theta_jack = np.array([
statistic_fn(np.delete(data, i)) for i in range(n)
])
theta_jack_mean = theta_jack.mean()
num = np.sum((theta_jack_mean - theta_jack) ** 3)
denom = 6 * (np.sum((theta_jack_mean - theta_jack) ** 2) ** 1.5)
a = num / denom if denom != 0 else 0
# Adjusted percentiles
z_alpha_lower = stats.norm.ppf(alpha / 2)
z_alpha_upper = stats.norm.ppf(1 - alpha / 2)
def adjusted_percentile(z_alpha):
numerator = z0 + z_alpha
denominator = 1 - a * numerator
return stats.norm.cdf(z0 + numerator / denominator)
p_lower = adjusted_percentile(z_alpha_lower)
p_upper = adjusted_percentile(z_alpha_upper)
# Bounds from bootstrap distribution
ci_lower = np.percentile(theta_boot, 100 * p_lower)
ci_upper = np.percentile(theta_boot, 100 * p_upper)
return {
'estimate': theta_hat,
'ci_lower': ci_lower,
'ci_upper': ci_upper,
'bias_correction': z0,
'acceleration': a,
'bootstrap_se': np.std(theta_boot)
}
# Compare methods on skewed data
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 200)
# Percentile
boot_stats = simple_bootstrap(revenue, np.mean, n_bootstrap=5000)
pct_ci = percentile_ci(boot_stats)
# BCa
bca_result = bca_ci(revenue, np.mean, n_bootstrap=5000)
print("Confidence Interval Comparison")
print("=" * 50)
print(f"Sample mean: ${np.mean(revenue):.2f}")
print(f"\nPercentile 95% CI: (${pct_ci[0]:.2f}, ${pct_ci[1]:.2f})")
print(f" Width: ${pct_ci[1] - pct_ci[0]:.2f}")
print(f"\nBCa 95% CI: (${bca_result['ci_lower']:.2f}, ${bca_result['ci_upper']:.2f})")
print(f" Width: ${bca_result['ci_upper'] - bca_result['ci_lower']:.2f}")
print(f" Bias correction (z0): {bca_result['bias_correction']:.3f}")
print(f" Acceleration (a): {bca_result['acceleration']:.4f}")
How Many Bootstrap Samples?
Guidelines
| Purpose | Minimum | Recommended | Heavy Tails |
|---|---|---|---|
| Standard error | 200 | 500 | 1000 |
| 95% CI bounds | 1000 | 2000 | 5000 |
| 99% CI bounds | 2000 | 5000 | 10000 |
| p-values | 5000 | 10000 | 20000 |
Monte Carlo Error
Bootstrap estimates are random. More resamples = less Monte Carlo error.
def bootstrap_stability_check(data, statistic_fn, resample_sizes=[500, 1000, 2000, 5000, 10000]):
"""
Check how stable bootstrap CI is across different resample sizes.
"""
results = []
for B in resample_sizes:
# Run 10 times to see variability
cis = []
for _ in range(10):
boot_stats = simple_bootstrap(data, statistic_fn, n_bootstrap=B)
ci = percentile_ci(boot_stats)
cis.append(ci)
cis = np.array(cis)
lower_std = cis[:, 0].std()
upper_std = cis[:, 1].std()
results.append({
'B': B,
'lower_bound_sd': lower_std,
'upper_bound_sd': upper_std
})
return results
# Example: Heavy-tailed revenue
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 300)
stability = bootstrap_stability_check(revenue, np.mean)
print("Bootstrap CI Stability (SD of bounds across 10 runs)")
print("=" * 50)
print(f"{'B':<10} {'Lower SD':<15} {'Upper SD':<15}")
print("-" * 50)
for r in stability:
print(f"{r['B']:<10} ${r['lower_bound_sd']:<14.2f} ${r['upper_bound_sd']:<14.2f}")
When Bootstrap Fails
Problem 1: Sample Doesn't Represent Population
If your sample misses rare extreme values, bootstrap can't recover them.
def demonstrate_missing_extremes(true_sigma=2.0, sample_size=100, n_simulations=500):
"""
Show how bootstrap misses extreme values in small samples.
"""
np.random.seed(42)
coverage = []
ci_widths = []
for _ in range(n_simulations):
# Generate heavy-tailed data
sample = np.random.lognormal(3, true_sigma, sample_size)
# True population mean (analytical)
true_mean = np.exp(3 + true_sigma**2 / 2)
# Bootstrap CI
result = bca_ci(sample, np.mean, n_bootstrap=2000)
# Check coverage
covered = result['ci_lower'] <= true_mean <= result['ci_upper']
coverage.append(covered)
ci_widths.append(result['ci_upper'] - result['ci_lower'])
return {
'coverage': np.mean(coverage) * 100,
'mean_width': np.mean(ci_widths),
'true_mean': np.exp(3 + true_sigma**2 / 2)
}
# Compare coverage at different sample sizes
print("Bootstrap Coverage by Sample Size (Heavy-Tailed Data)")
print("=" * 60)
print(f"{'n':<10} {'Coverage':<15} {'Mean CI Width':<20}")
print("-" * 60)
for n in [50, 100, 200, 500, 1000]:
result = demonstrate_missing_extremes(true_sigma=2.0, sample_size=n, n_simulations=300)
print(f"{n:<10} {result['coverage']:.1f}%{'':<10} ${result['mean_width']:.2f}")
print(f"\nTrue mean: ${np.exp(3 + 2**2/2):.2f}")
print("Note: 95% CI should cover 95% of the time. Undercoverage indicates bootstrap failure.")
Problem 2: Non-Smooth Statistics
Bootstrap struggles with:
- Maximum/minimum: Discrete bootstrap distribution
- Extreme quantiles with small n: Few observations in region
- Discontinuous statistics: Mode, indicators
# Example: Bootstrap for max is problematic
np.random.seed(42)
data = np.random.lognormal(3, 1, 100)
boot_max = simple_bootstrap(data, np.max, n_bootstrap=2000)
print("Bootstrap for Maximum (Problematic)")
print("=" * 50)
print(f"Sample max: ${np.max(data):.2f}")
print(f"Bootstrap distribution of max:")
print(f" Unique values: {len(np.unique(boot_max))}")
print(f" (Should be continuous for valid CI)")
Problem 3: Very Small Samples
Rule of thumb: n > 50 for basic bootstrap, n > 100 for heavy tails.
Comparing Groups with Bootstrap
Two-Sample Bootstrap
def bootstrap_two_sample(group1, group2, statistic='mean_diff', n_bootstrap=5000):
"""
Bootstrap comparison of two groups.
Parameters:
-----------
group1, group2 : arrays
Data for each group
statistic : str
'mean_diff', 'ratio', or 'lift'
n_bootstrap : int
Number of resamples
Returns:
--------
dict with estimate, CI, and p-value
"""
g1 = np.asarray(group1)
g2 = np.asarray(group2)
# Point estimates
if statistic == 'mean_diff':
point_est = np.mean(g2) - np.mean(g1)
elif statistic == 'ratio':
point_est = np.mean(g2) / np.mean(g1)
elif statistic == 'lift':
point_est = (np.mean(g2) - np.mean(g1)) / np.mean(g1)
# Bootstrap
boot_stats = []
for _ in range(n_bootstrap):
boot_g1 = np.random.choice(g1, len(g1), replace=True)
boot_g2 = np.random.choice(g2, len(g2), replace=True)
if statistic == 'mean_diff':
boot_stat = np.mean(boot_g2) - np.mean(boot_g1)
elif statistic == 'ratio':
boot_stat = np.mean(boot_g2) / np.mean(boot_g1) if np.mean(boot_g1) > 0 else np.nan
elif statistic == 'lift':
boot_stat = (np.mean(boot_g2) - np.mean(boot_g1)) / np.mean(boot_g1) if np.mean(boot_g1) > 0 else np.nan
if not np.isnan(boot_stat):
boot_stats.append(boot_stat)
boot_stats = np.array(boot_stats)
# CI
ci = np.percentile(boot_stats, [2.5, 97.5])
# P-value (two-sided test against null of 0 difference or ratio of 1)
null_value = 0 if statistic in ['mean_diff', 'lift'] else 1
p_value = 2 * min(
np.mean(boot_stats <= null_value),
np.mean(boot_stats >= null_value)
)
return {
'estimate': point_est,
'ci_lower': ci[0],
'ci_upper': ci[1],
'p_value': min(p_value, 1.0),
'bootstrap_se': np.std(boot_stats)
}
# Example: Revenue comparison
np.random.seed(42)
control = np.random.lognormal(3, 1.2, 1500)
treatment = np.random.lognormal(3.1, 1.2, 1500) # 10% higher geometric mean
print("Bootstrap Two-Sample Comparison")
print("=" * 60)
for stat in ['mean_diff', 'ratio', 'lift']:
result = bootstrap_two_sample(control, treatment, statistic=stat, n_bootstrap=5000)
print(f"\n{stat.replace('_', ' ').title()}:")
if stat == 'mean_diff':
print(f" Estimate: ${result['estimate']:.2f}")
print(f" 95% CI: (${result['ci_lower']:.2f}, ${result['ci_upper']:.2f})")
elif stat == 'ratio':
print(f" Estimate: {result['estimate']:.3f}")
print(f" 95% CI: ({result['ci_lower']:.3f}, {result['ci_upper']:.3f})")
else:
print(f" Estimate: {result['estimate']:.1%}")
print(f" 95% CI: ({result['ci_lower']:.1%}, {result['ci_upper']:.1%})")
print(f" p-value: {result['p_value']:.4f}")
Visualizing Bootstrap Results
import matplotlib.pyplot as plt
def plot_bootstrap_diagnostics(data, statistic_fn, statistic_name='Statistic', n_bootstrap=5000):
"""
Diagnostic plots for bootstrap.
"""
boot_stats = simple_bootstrap(data, statistic_fn, n_bootstrap)
point_est = statistic_fn(data)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# 1. Bootstrap distribution
ax1 = axes[0]
ax1.hist(boot_stats, bins=50, edgecolor='white', alpha=0.7)
ax1.axvline(point_est, color='red', linestyle='--', label='Original estimate')
ax1.axvline(np.mean(boot_stats), color='blue', linestyle=':', label='Bootstrap mean')
ci = np.percentile(boot_stats, [2.5, 97.5])
ax1.axvline(ci[0], color='green', linestyle='-', label='95% CI')
ax1.axvline(ci[1], color='green', linestyle='-')
ax1.set_xlabel(statistic_name)
ax1.set_ylabel('Count')
ax1.set_title('Bootstrap Distribution')
ax1.legend()
# 2. Q-Q plot against normal
ax2 = axes[1]
from scipy import stats
stats.probplot(boot_stats, dist="norm", plot=ax2)
ax2.set_title('Q-Q Plot (vs. Normal)')
# 3. Running CI stability
ax3 = axes[2]
cumulative_lower = []
cumulative_upper = []
check_points = range(100, n_bootstrap + 1, 100)
for b in check_points:
ci = np.percentile(boot_stats[:b], [2.5, 97.5])
cumulative_lower.append(ci[0])
cumulative_upper.append(ci[1])
ax3.plot(check_points, cumulative_lower, label='Lower bound')
ax3.plot(check_points, cumulative_upper, label='Upper bound')
ax3.axhline(np.percentile(boot_stats, 2.5), color='grey', linestyle='--', alpha=0.5)
ax3.axhline(np.percentile(boot_stats, 97.5), color='grey', linestyle='--', alpha=0.5)
ax3.set_xlabel('Number of Bootstrap Samples')
ax3.set_ylabel('CI Bound')
ax3.set_title('CI Stability')
ax3.legend()
plt.tight_layout()
return fig
# Example
np.random.seed(42)
revenue = np.random.lognormal(3, 1.5, 500)
fig = plot_bootstrap_diagnostics(revenue, np.mean, 'Mean Revenue ($)', n_bootstrap=5000)
plt.show()
Best Practices Checklist
Before Bootstrapping
- Sample size adequate (n > 50, ideally > 100)
- Sample represents population (no selection bias)
- Statistic is smooth (not max/min/mode)
- Data is independent (or account for clustering)
During Bootstrapping
- Use enough resamples (5000+ for CI, 10000+ for p-values)
- Use BCa for skewed statistics
- Resample at correct level (user, not observation if clustered)
After Bootstrapping
- Visualize bootstrap distribution (should be smooth)
- Check CI stability across resample sizes
- Compare with other methods as sanity check
- Report bootstrap SE and CI method used
R Implementation
library(boot)
# Basic bootstrap
boot_fn <- function(data, indices) {
mean(data[indices])
}
boot_result <- boot(revenue, boot_fn, R = 5000)
# BCa confidence interval
boot.ci(boot_result, type = "bca")
# Two-sample comparison
boot_diff <- function(data, indices) {
d <- data[indices, ]
mean(d$revenue[d$group == "treatment"]) -
mean(d$revenue[d$group == "control"])
}
combined <- data.frame(
revenue = c(control, treatment),
group = rep(c("control", "treatment"), c(length(control), length(treatment)))
)
boot_result <- boot(combined, boot_diff, R = 5000)
boot.ci(boot_result, type = "bca")
Related Methods
- Metric Distributions (Pillar) - Full distributions overview
- Winsorization and Trimming - Outlier handling
- Delta Method vs. Bootstrap - When to use which
- Confidence Intervals for Non-Normal Metrics - CI deep dive
Key Takeaway
Bootstrap is essential for heavy-tailed metrics—it handles skew naturally and makes no distributional assumptions. But bootstrap isn't magic: it needs adequate sample sizes, enough resamples (5000+ for heavy tails), and appropriate CI methods (BCa for skewed statistics). Always visualize your bootstrap distribution and check stability. When your sample is small or might miss rare extremes, bootstrap CIs will be too narrow—consider this when interpreting results. Used properly, bootstrap provides reliable inference for the difficult metrics that standard methods can't handle.
References
- https://doi.org/10.1214/ss/1177013815
- https://www.jstor.org/stable/2290951
- https://doi.org/10.1111/j.1467-9868.2010.00753.x
- Efron, B. (1987). Better bootstrap confidence intervals. *Journal of the American Statistical Association*, 82(397), 171-185.
- DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. *Statistical Science*, 11(3), 189-228.
- Davison, A. C., & Hinkley, D. V. (1997). *Bootstrap Methods and their Application*. Cambridge University Press.
Frequently Asked Questions
How many bootstrap samples do I need?
Does bootstrap work for any distribution?
What's the difference between percentile and BCa bootstrap?
Key Takeaway
Bootstrap is the workhorse for heavy-tailed metrics, but it's not foolproof. Use BCa intervals for skewed data, increase resamples to 5000+ for reliable bounds, and be cautious with small samples where the bootstrap can't capture rare extreme values. Always visualize your bootstrap distribution and run sensitivity checks. Bootstrap inherits your sample's properties—if the sample is unrepresentative, so is the bootstrap.