Effect Sizes

Power Analysis Without Cargo Culting: Traps and Practical Heuristics

A practical guide to statistical power analysis that avoids common pitfalls. Learn when standard power calculations mislead, how to think about sample size decisions, and practical heuristics for real-world experimentation.

Share

Quick Hits

  • Power = probability of detecting an effect that actually exists
  • 80% power means 20% chance of missing a real effect - that's often too high
  • The hardest part is choosing the effect size - everything else is arithmetic
  • Post-hoc power analysis is meaningless - don't do it
  • Consider precision (CI width) as an alternative planning framework

TL;DR

Power analysis helps you determine sample sizes needed to detect effects of a given size. But standard power calculations are often cargo-culted—performed ritualistically without understanding what they mean. The critical input is choosing an effect size that matters for your decision, not using conventions like "small effect = 0.2." Post-hoc power is meaningless. Consider precision-based planning as an alternative. This article provides practical guidance for sample size decisions in real experiments.


What Power Actually Means

Statistical power is the probability of detecting an effect when one truly exists.

If your experiment has 80% power:

  • There's an 80% chance you'll get p < 0.05 if the true effect is at least as large as your assumed effect size
  • There's a 20% chance you'll "miss" the effect (Type II error / false negative)

The power equation has four connected parameters:

  1. Effect size: How big is the true effect?
  2. Sample size: How many observations?
  3. Alpha level: Your false positive threshold (usually 0.05)
  4. Power: Your target true positive rate (usually 0.80)

Fix any three, and the fourth is determined.


The Core Problem: Choosing Effect Size

Power calculations are straightforward arithmetic. The hard part—and where most analyses go wrong—is choosing the effect size to plan for.

Bad Approach: Cohen's Conventions

Cohen (1988) suggested "small" (d=0.2), "medium" (d=0.5), and "large" (d=0.8) effect sizes. These were meant as rough guidelines when you have no domain knowledge.

Why this fails:

  • A d=0.2 effect on revenue might be worth millions
  • A d=0.5 effect on a vanity metric might be worthless
  • Effect sizes don't come labeled—you need to define what matters

Better Approach: Minimum Detectable Effect (MDE)

Instead of asking "what effect do I expect?", ask:

"What's the smallest effect that would change my decision?"

This is your MDE—the practical significance threshold below which you wouldn't take action anyway.

Example:

  • Implementing the feature costs $500K in engineering time
  • Need at least $600K annual revenue lift to justify
  • Current revenue: $10M
  • MDE: 6% revenue increase
  • Plan your experiment to detect 6% with high confidence

Common Power Analysis Traps

Trap 1: Using Expected Effect Size

Mistake: "We expect a 10% lift, so we'll power for 10%"

Problem: If you're wrong about your expectation, you won't detect smaller real effects.

Fix: Power for the smallest effect worth detecting, not your best guess of what will happen.

Trap 2: Post-Hoc Power Analysis

Mistake: After a non-significant result, calculating "we only had 30% power"

Problem: Post-hoc power is a deterministic function of your p-value. If p=0.05, observed power is always ~50%. It adds zero information.

What post-hoc power tells you:

  • If p is high, observed power is low
  • If p is low, observed power is high
  • This is mathematical tautology, not insight

Fix: Report confidence intervals instead. "We can rule out effects larger than X."

# Post-hoc power is meaningless - here's why
import scipy.stats as stats

# If you observe p = 0.05 (z = 1.96)
# Observed power at that exact effect size is always ~50%
z_crit = 1.96
observed_power = 1 - stats.norm.cdf(z_crit - z_crit)  # Always 0.5

Trap 3: Treating 80% as Sacred

Mistake: Always using 80% power because it's conventional.

Problem: 80% power = 20% miss rate. For important decisions, that's often too high.

Consider:

  • High-stakes decision → 90-95% power
  • Exploratory analysis → 80% may be fine
  • Expensive intervention → power for practical significance threshold, not just statistical

Trap 4: Ignoring Precision

Mistake: Only thinking about statistical significance, not estimation precision.

Problem: You might detect an effect is non-zero without knowing if it's meaningful.

Fix: Plan for confidence interval width, not just p < 0.05.

Trap 5: One-Time Calculation

Mistake: Running power analysis once before the experiment and never revisiting.

Problem: Real experiments have noise, dropouts, and surprises.

Fix: Monitor effective sample size during the experiment. Build in buffer (10-20% extra).


Practical Power Analysis

Step 1: Define Your MDE in Business Terms

Ask:

  • What decision will this experiment inform?
  • What's the cost of implementing vs. not implementing?
  • Below what effect size would we not take action?

Example calculation:

Implementation cost: $200K
Required ROI: 2x in first year
Annual baseline: $5M revenue
MDE needed: $400K / $5M = 8% revenue lift

Step 2: Convert to Standardized Effect Size (If Needed)

For comparing means, Cohen's d relates to your raw effect:

$$d = \frac{\text{MDE}}{\text{Standard Deviation}}$$

Example:

  • MDE: 5% lift on a metric with mean=100
  • Raw MDE: 5 units
  • Historical SD: 25 units
  • d = 5/25 = 0.2

Step 3: Calculate Sample Size

For a two-sample t-test at α=0.05:

Power Sample per group (d=0.2) d=0.5 d=0.8
80% 393 64 26
90% 527 86 34
95% 651 105 42

Step 4: Reality Check

  • Can you actually get this sample?
  • How long will it take?
  • Is the experiment worth running at this sample size?
  • What happens if the true effect is half your MDE?

Code: Power Analysis

Python

import numpy as np
from scipy import stats
from statsmodels.stats.power import TTestIndPower, NormalIndPower


def sample_size_for_means(
    mde: float,
    baseline_sd: float,
    alpha: float = 0.05,
    power: float = 0.80,
    ratio: float = 1.0
) -> dict:
    """
    Calculate sample size needed to detect a difference in means.

    Parameters:
    -----------
    mde : float
        Minimum detectable effect in raw units
    baseline_sd : float
        Standard deviation of the metric
    alpha : float
        Significance level (default 0.05)
    power : float
        Desired power (default 0.80)
    ratio : float
        Ratio of treatment to control size (default 1.0 = equal)

    Returns:
    --------
    dict with sample sizes and effect size
    """
    effect_size = mde / baseline_sd  # Cohen's d

    analysis = TTestIndPower()
    n_per_group = analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        ratio=ratio,
        alternative='two-sided'
    )

    return {
        'effect_size_d': effect_size,
        'n_control': int(np.ceil(n_per_group)),
        'n_treatment': int(np.ceil(n_per_group * ratio)),
        'n_total': int(np.ceil(n_per_group * (1 + ratio))),
        'power': power,
        'alpha': alpha
    }


def sample_size_for_proportions(
    baseline_rate: float,
    mde_absolute: float,
    alpha: float = 0.05,
    power: float = 0.80
) -> dict:
    """
    Calculate sample size for detecting a difference in proportions.

    Parameters:
    -----------
    baseline_rate : float
        Current conversion rate (e.g., 0.05 for 5%)
    mde_absolute : float
        Minimum detectable effect in absolute terms (e.g., 0.01 for 1pp)
    alpha : float
        Significance level
    power : float
        Desired power

    Returns:
    --------
    dict with sample sizes
    """
    p1 = baseline_rate
    p2 = baseline_rate + mde_absolute

    # Effect size for proportions (Cohen's h)
    h = 2 * np.arcsin(np.sqrt(p2)) - 2 * np.arcsin(np.sqrt(p1))

    analysis = NormalIndPower()
    n_per_group = analysis.solve_power(
        effect_size=h,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    return {
        'baseline_rate': p1,
        'target_rate': p2,
        'mde_relative': (p2 - p1) / p1 * 100,
        'effect_size_h': h,
        'n_per_group': int(np.ceil(n_per_group)),
        'n_total': int(np.ceil(n_per_group * 2)),
        'power': power,
        'alpha': alpha
    }


def power_curve(
    baseline_sd: float,
    n_per_group: int,
    alpha: float = 0.05,
    effect_sizes: np.ndarray = None
) -> dict:
    """
    Calculate power across a range of effect sizes.

    Useful for understanding sensitivity of your design.
    """
    if effect_sizes is None:
        effect_sizes = np.linspace(0.05, 1.0, 20)

    analysis = TTestIndPower()
    powers = [
        analysis.power(effect_size=d, nobs1=n_per_group, alpha=alpha, ratio=1.0)
        for d in effect_sizes
    ]

    return {
        'effect_sizes': effect_sizes,
        'powers': powers,
        'mde_at_80': effect_sizes[np.argmin(np.abs(np.array(powers) - 0.8))]
    }


# Example: Planning an A/B test
if __name__ == "__main__":
    # Scenario: Testing new checkout flow
    # Baseline conversion: 3.2%
    # Want to detect 0.5 percentage point increase (to 3.7%)
    # That's a ~15% relative lift

    result = sample_size_for_proportions(
        baseline_rate=0.032,
        mde_absolute=0.005,  # 0.5 percentage points
        alpha=0.05,
        power=0.80
    )

    print("Sample Size Calculation")
    print("=" * 40)
    print(f"Baseline rate: {result['baseline_rate']:.1%}")
    print(f"Target rate: {result['target_rate']:.1%}")
    print(f"Relative lift: {result['mde_relative']:.1f}%")
    print(f"Effect size (h): {result['effect_size_h']:.3f}")
    print(f"Sample per group: {result['n_per_group']:,}")
    print(f"Total sample: {result['n_total']:,}")

R

library(pwr)

sample_size_for_means <- function(
    mde,
    baseline_sd,
    alpha = 0.05,
    power = 0.80,
    ratio = 1.0
) {
    #' Calculate sample size for detecting difference in means

    effect_size <- mde / baseline_sd

    result <- pwr.t2n.test(
        d = effect_size,
        sig.level = alpha,
        power = power,
        alternative = "two.sided"
    )

    # For unequal allocation, use pwr.t.test with n = NULL
    if (ratio == 1.0) {
        result <- pwr.t.test(
            d = effect_size,
            sig.level = alpha,
            power = power,
            type = "two.sample"
        )
        n_per_group <- ceiling(result$n)
    } else {
        # Iterative solve for unequal allocation
        # Simplified: use equal allocation formula with correction
        result <- pwr.t.test(
            d = effect_size,
            sig.level = alpha,
            power = power,
            type = "two.sample"
        )
        n_control <- ceiling(result$n)
        n_treatment <- ceiling(result$n * ratio)
        n_per_group <- n_control
    }

    list(
        effect_size_d = effect_size,
        n_per_group = n_per_group,
        n_total = n_per_group * (1 + ratio),
        power = power,
        alpha = alpha
    )
}


sample_size_for_proportions <- function(
    baseline_rate,
    mde_absolute,
    alpha = 0.05,
    power = 0.80
) {
    #' Calculate sample size for detecting difference in proportions

    p1 <- baseline_rate
    p2 <- baseline_rate + mde_absolute

    # Cohen's h effect size for proportions
    h <- 2 * asin(sqrt(p2)) - 2 * asin(sqrt(p1))

    result <- pwr.2p.test(
        h = h,
        sig.level = alpha,
        power = power
    )

    list(
        baseline_rate = p1,
        target_rate = p2,
        mde_relative = (p2 - p1) / p1 * 100,
        effect_size_h = h,
        n_per_group = ceiling(result$n),
        n_total = ceiling(result$n) * 2,
        power = power,
        alpha = alpha
    )
}


# Precision-based planning
sample_size_for_precision <- function(
    baseline_sd,
    desired_ci_width,
    confidence = 0.95
) {
    #' Calculate sample size to achieve desired CI width

    z <- qnorm(1 - (1 - confidence) / 2)
    margin <- desired_ci_width / 2

    # For difference of two means: SE = SD * sqrt(2/n)
    # margin = z * SD * sqrt(2/n)
    # n = 2 * (z * SD / margin)^2

    n_per_group <- ceiling(2 * (z * baseline_sd / margin)^2)

    list(
        n_per_group = n_per_group,
        n_total = n_per_group * 2,
        expected_ci_width = desired_ci_width
    )
}


# Example usage
cat("Sample Size for Proportions\n")
cat(strrep("=", 40), "\n")

result <- sample_size_for_proportions(
    baseline_rate = 0.032,
    mde_absolute = 0.005,
    alpha = 0.05,
    power = 0.80
)

cat(sprintf("Baseline rate: %.1f%%\n", result$baseline_rate * 100))
cat(sprintf("Target rate: %.1f%%\n", result$target_rate * 100))
cat(sprintf("Relative lift: %.1f%%\n", result$mde_relative))
cat(sprintf("Sample per group: %d\n", result$n_per_group))
cat(sprintf("Total sample: %d\n", result$n_total))

Precision-Based Planning Alternative

Instead of "Can I detect p < 0.05?", ask "How precise will my estimate be?"

The Framework

Traditional power question: "What sample size do I need to have 80% chance of p < 0.05 if the effect is at least d=0.2?"

Precision question: "What sample size do I need so my 95% CI for the effect is no wider than ±2 percentage points?"

Advantages of Precision Planning

  1. Decision-relevant: Stakeholders care about "is the effect between 5% and 15%?" not "is p < 0.05?"
  2. Works for null results: Narrow CI around zero is informative
  3. Avoids dichotomous thinking: No artificial significant/not-significant boundary
  4. Intuitive: "We'll know the effect to within ±2%" is easier to understand than "80% power at d=0.3"

Calculation

For estimating a mean difference with 95% CI:

$$n = 2 \times \left(\frac{1.96 \times \sigma}{\text{desired margin}}\right)^2$$

Example:

  • SD = 25 units
  • Want 95% CI within ±3 units
  • n = 2 × (1.96 × 25 / 3)² = 534 per group
def sample_size_for_precision(
    baseline_sd: float,
    desired_ci_width: float,
    confidence: float = 0.95
) -> dict:
    """
    Calculate sample size to achieve desired confidence interval width.

    Parameters:
    -----------
    baseline_sd : float
        Standard deviation of the metric
    desired_ci_width : float
        Total CI width (e.g., for ±3, enter 6)
    confidence : float
        Confidence level (default 0.95)

    Returns:
    --------
    dict with sample size and expected precision
    """
    z = stats.norm.ppf(1 - (1 - confidence) / 2)
    margin = desired_ci_width / 2

    # For difference of two means
    n_per_group = int(np.ceil(2 * (z * baseline_sd / margin) ** 2))

    return {
        'n_per_group': n_per_group,
        'n_total': n_per_group * 2,
        'expected_ci_width': desired_ci_width,
        'expected_margin': margin
    }

Practical Heuristics

Heuristic 1: The "10% Buffer" Rule

Always collect 10-20% more sample than your calculation suggests:

  • Dropouts and invalid data happen
  • Variance estimates are uncertain
  • Better to have extra data than be underpowered

Heuristic 2: The "Half-Effect" Check

After calculating sample size for your MDE:

  • What's your power if the true effect is half your MDE?
  • If it's below 50%, you may be overconfident in your detection ability

Heuristic 3: The "Worth Running" Test

Before committing to an experiment:

  • Calculate how long it takes to get required sample
  • Calculate cost of running that long
  • Ask: "Is the decision worth this investment if the answer could be null?"

Heuristic 4: The "Precision Reasonableness" Check

Calculate what your CI width will be with your planned sample:

  • Is that precision useful for your decision?
  • Can you distinguish "meaningful positive" from "meaningful negative"?

Heuristic 5: The "Worst Case" Planning

Plan for your MDE, but check:

  • What's the power for effects 50% larger? (Should be >90%)
  • What's the power for effects 50% smaller? (If <50%, you'll likely miss smaller real effects)

When Power Analysis Breaks Down

Case 1: Sequential Testing

Traditional power analysis assumes fixed sample size. If you're doing sequential testing (checking results as data arrives), you need different calculations that account for multiple looks.

Case 2: Multiple Comparisons

Power for detecting "at least one" effect with multiple comparisons is different from power for any single comparison. Adjust accordingly.

Case 3: Complex Designs

Clustered experiments, stratified designs, and CUPED-adjusted analyses have different power characteristics. Use simulation when formulas don't exist.

Case 4: Highly Variable Metrics

For heavy-tailed metrics (revenue, etc.), standard power calculations may be optimistic. Consider:

  • Using robust methods (trimmed means)
  • Transforming the metric
  • Running simulations with realistic distributions

Case 5: Small True Effects

If your MDE is very small relative to variance, you may need enormous samples. Consider:

  • Is this effect worth detecting?
  • Can you reduce variance (CUPED, stratification)?
  • Should you focus on a subset with larger expected effect?

Simulation-Based Power Analysis

When formulas don't fit, simulate:

import numpy as np
from scipy import stats


def simulate_power(
    effect_size: float,
    n_per_group: int,
    sd: float = 1.0,
    alpha: float = 0.05,
    n_simulations: int = 10000,
    test_func=None
) -> float:
    """
    Estimate power via simulation.

    Useful for complex scenarios where formulas don't exist.
    """
    if test_func is None:
        test_func = lambda x, y: stats.ttest_ind(x, y).pvalue

    rejections = 0

    for _ in range(n_simulations):
        control = np.random.normal(0, sd, n_per_group)
        treatment = np.random.normal(effect_size, sd, n_per_group)

        p_value = test_func(control, treatment)
        if p_value < alpha:
            rejections += 1

    return rejections / n_simulations


# Example: Power for trimmed mean comparison
from scipy.stats import trim_mean


def trimmed_mean_test(x, y, proportiontocut=0.1):
    """Bootstrap test for trimmed mean difference."""
    n_boot = 1000
    diffs = []

    for _ in range(n_boot):
        x_boot = np.random.choice(x, len(x), replace=True)
        y_boot = np.random.choice(y, len(y), replace=True)
        diff = trim_mean(x_boot, proportiontocut) - trim_mean(y_boot, proportiontocut)
        diffs.append(diff)

    # Two-sided p-value
    observed_diff = trim_mean(x, proportiontocut) - trim_mean(y, proportiontocut)
    p_value = 2 * min(np.mean(np.array(diffs) <= 0), np.mean(np.array(diffs) >= 0))

    return p_value


# Compare power: standard t-test vs trimmed mean
# for contaminated normal (heavy tails)
def contaminated_normal(n, contamination=0.1, scale=10):
    """Generate contaminated normal data (heavy tails)."""
    n_outliers = int(n * contamination)
    n_normal = n - n_outliers
    normal = np.random.normal(0, 1, n_normal)
    outliers = np.random.normal(0, scale, n_outliers)
    return np.concatenate([normal, outliers])


# This simulation shows trimmed mean has HIGHER power
# with contaminated data

Summary Table: Power Analysis Decisions

Situation Recommendation
Planning new experiment Use MDE (smallest meaningful effect), not expected effect
Standard A/B test Formula-based calculation is fine
Complex design Use simulation
After null result Report CI bounds, NOT post-hoc power
High-stakes decision Power at 90%+, not just 80%
Exploratory analysis 80% power acceptable
Expensive intervention Consider precision planning instead
Sequential testing Use group sequential methods
Multiple comparisons Account for family-wise error


Key Takeaway

Power analysis is a tool for planning, not a ritual for legitimizing sample sizes. The core question is simple: what effect would change your decision, and how confident do you need to be in detecting it? Base your MDE on business value, not statistical conventions. Don't perform post-hoc power calculations—they're mathematically meaningless. Consider precision-based planning as an alternative that focuses on estimation quality rather than significance testing. And remember: an underpowered experiment isn't just statistically weak—it's potentially a waste of resources.


References

  1. https://rpsychologist.com/d3/nhst/
  2. https://journals.sagepub.com/doi/10.1177/0956797613504966
  3. https://statmodeling.stat.columbia.edu/2018/09/24/dont-calculate-post-hoc-power-using-observed-estimate-effect-size/
  4. https://www.sciencedirect.com/science/article/pii/S0022103117307746
  5. Lakens, D. (2022). Sample size justification. *Collabra: Psychology*, 8(1), 33267.
  6. Gelman, A. (2018). Don't calculate post-hoc power using the observed estimate of effect size. *Statistical Modeling, Causal Inference, and Social Science* (blog).
  7. Simonsohn, U. (2015). Small telescopes: Detectability and the evaluation of replication results. *Psychological Science*, 26(5), 559-569.
  8. Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. *Annual Review of Psychology*, 59, 537-563.

Frequently Asked Questions

Why is 80% power the standard?
It's mostly convention from Jacob Cohen's work in the 1960s, chosen somewhat arbitrarily as a 'reasonable' balance. For important decisions, 80% may be too low—you're accepting a 20% chance of missing real effects. Consider 90% for high-stakes decisions.
Should I do a post-hoc power analysis after a non-significant result?
No. Post-hoc (observed) power is mathematically determined by your p-value and adds no information. If p = 0.05, observed power is always ~50% by definition. Instead, report confidence intervals showing what effect sizes you can rule out.
What effect size should I use for planning?
Use the smallest effect that would be practically meaningful to detect—your MDE. Don't use 'small/medium/large' conventions without relating them to business impact. Base it on what would change your decision, not what you expect to find.

Key Takeaway

Power analysis is a planning tool, not a ritual. Focus on the practical question: what's the smallest effect worth detecting, and how confident do you need to be in detecting it? Base sample size decisions on business value, not convention.

Send to a friend

Share this with someone who loves clean statistical work.