A/B Testing

Minimum Detectable Effect and Sample Size: A Practical Guide

Learn how to calculate the minimum detectable effect for your A/B test, determine required sample sizes, and understand the tradeoffs between statistical power and practical constraints.

Jan 268 min readstatstest_flow A/B Testing Supporting

Minimum Detectable Effect and Sample Size: A Practical Guide

Quick Hits

•MDE and sample size are inversely related—halving your MDE requires 4x the sample
•Calculate your achievable MDE before running a test, not after
•An underpowered test is worse than no test—you'll get inconclusive results and waste time
•80% power is standard, but consider 90% for high-stakes decisions

TL;DR

Before running any A/B test, you need to know two things: how many users you'll have, and what's the smallest effect you could reliably detect with that many users. That smallest detectable effect is your Minimum Detectable Effect (MDE). If your MDE is larger than any realistic impact of your change, the test isn't worth running.

The Fundamental Relationship

Four quantities govern experiment design:

Baseline rate (p₀): Your current conversion rate or metric mean
Minimum Detectable Effect (MDE): The smallest effect you want to detect
Significance level ( $\alpha$ ): False positive rate, typically 0.05
Power (1-β): Probability of detecting a real effect, typically 0.80

Given any three, you can calculate the fourth. In practice:

$\alpha$ and power are typically fixed at 0.05 and 0.80
Either you have a target MDE and need to find sample size, or
You have a fixed sample and need to find your achievable MDE

The Sample Size Formula (for proportions)

For comparing two proportions with equal group sizes:

$n = \frac{2 \cdot (z_{1-\alpha/2} + z_{1-\beta})^2 \cdot p(1-p)}{(p_1 - p_0)^2}$

Where:

$n$ = sample size per group
$z_{1-\alpha/2} \\approx 1.96$ for $\alpha = 0.05$ (two-sided)
$z_{1-\beta} \\approx 0.84$ for 80% power
$p$ = pooled proportion $\\approx (p_0 + p_1)/2$
$(p_1 - p_0)$ = absolute effect (your MDE)

The Critical Insight: Quadratic Relationship

Sample size scales with the square of precision. Detecting half the effect requires four times the sample:

Relative MDE	Sample Multiplier
1x (baseline)	1x
0.5x	4x
0.25x	16x
0.1x	100x

This quadratic relationship is why small effects require enormous samples.

Calculating Sample Size

For Conversion Rates (Binary Metrics)

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

def sample_size_for_proportions(baseline_rate, relative_lift, alpha=0.05, power=0.80):
    """
    Calculate required sample size per group for a proportion test.

    Args:
        baseline_rate: Current conversion rate (e.g., 0.05 for 5%)
        relative_lift: Relative change to detect (e.g., 0.10 for 10% lift)
        alpha: Significance level (default 0.05)
        power: Statistical power (default 0.80)

    Returns:
        Required sample size per group
    """
    # Calculate absolute change
    absolute_lift = baseline_rate * relative_lift
    new_rate = baseline_rate + absolute_lift

    # Cohen's h effect size for proportions
    effect_size = proportion_effectsize(baseline_rate, new_rate)

    # Calculate sample size
    analysis = NormalIndPower()
    n = analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    return int(np.ceil(n))


# Example: 5% baseline, want to detect 10% relative lift
baseline = 0.05
relative_lift = 0.10  # detect 5% → 5.5%

n_per_group = sample_size_for_proportions(baseline, relative_lift)
print(f"Sample needed per group: {n_per_group:,}")
# Output: Sample needed per group: 31,234

For Continuous Metrics (Revenue, Time)

from statsmodels.stats.power import TTestIndPower

def sample_size_for_means(baseline_mean, baseline_std, relative_lift, alpha=0.05, power=0.80):
    """
    Calculate required sample size per group for a t-test.

    Args:
        baseline_mean: Current metric mean
        baseline_std: Current metric standard deviation
        relative_lift: Relative change to detect
        alpha: Significance level
        power: Statistical power

    Returns:
        Required sample size per group
    """
    absolute_lift = baseline_mean * relative_lift
    effect_size = absolute_lift / baseline_std  # Cohen's d

    analysis = TTestIndPower()
    n = analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    return int(np.ceil(n))


# Example: mean revenue $50, std $100, want to detect 5% lift
n_per_group = sample_size_for_means(
    baseline_mean=50,
    baseline_std=100,
    relative_lift=0.05
)
print(f"Sample needed per group: {n_per_group:,}")

R Implementation

# For proportions
library(pwr)

baseline <- 0.05
new_rate <- 0.055  # 10% relative lift

# Cohen's h effect size
h <- 2 * asin(sqrt(new_rate)) - 2 * asin(sqrt(baseline))

result <- pwr.2p.test(h = h, sig.level = 0.05, power = 0.80)
print(ceiling(result$n))

# For means
effect_size <- 2.5 / 100  # $2.50 lift / $100 std = Cohen's d of 0.025
result <- pwr.t.test(d = effect_size, sig.level = 0.05, power = 0.80, type = "two.sample")
print(ceiling(result$n))

Calculating Achievable MDE

More often, you have a fixed traffic budget and need to know what you can detect:

import numpy as np
from scipy import stats

def achievable_mde_proportions(baseline_rate, n_per_group, alpha=0.05, power=0.80):
    """
    Calculate the minimum detectable effect given a fixed sample size.

    Args:
        baseline_rate: Current conversion rate
        n_per_group: Available sample per group
        alpha: Significance level
        power: Statistical power

    Returns:
        Absolute MDE (add to baseline for new rate)
    """
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)

    # Approximate formula for MDE
    se = np.sqrt(2 * baseline_rate * (1 - baseline_rate) / n_per_group)
    mde = (z_alpha + z_beta) * se

    return mde


# Example: 5% baseline, 10,000 users per group
baseline = 0.05
n = 10000

mde = achievable_mde_proportions(baseline, n)
relative_mde = mde / baseline

print(f"Achievable MDE: {mde:.4f} ({relative_mde:.1%} relative)")
# Output: Achievable MDE: 0.0086 (17.3% relative)

Common Sample Size Scenarios

Here's a reference table for conversion rate tests ( $\alpha = 0.05$ , power = 80%):

Baseline Rate	5% Relative Lift	10% Relative Lift	20% Relative Lift
1%	3,141,592	785,398	196,350
2%	1,545,098	386,275	96,569
5%	608,265	152,066	38,017
10%	291,671	72,918	18,230
20%	130,075	32,519	8,130
50%	32,019	8,005	2,001

Key observations:

Lower baseline rates require dramatically more sample
A 5% relative lift requires 4x the sample of a 10% relative lift
High-conversion metrics (50% baseline) are much easier to test

The Pre-Test Checklist

Before launching any experiment:

1. Know Your Traffic

How many eligible users will enter the experiment per day/week? Be conservative—actual traffic is often lower than expected due to:

Eligibility criteria
Technical issues
Seasonal variation

2. Set Your Runtime

How long can you run the test? Consider:

Product cycles and launch pressure
Weekly and seasonal patterns (run at least one full week)
User patience with your stakeholders

3. Calculate Your MDE

With traffic × runtime = total sample, what's your achievable MDE?

# Example calculation
daily_traffic = 5000  # users per day entering experiment
test_duration_days = 14  # two weeks
allocation = 0.50  # 50% to each variant

total_sample = daily_traffic * test_duration_days
n_per_group = total_sample * allocation

mde = achievable_mde_proportions(baseline_rate=0.05, n_per_group=n_per_group)
print(f"With {n_per_group:,.0f} users per group, MDE is {mde/0.05:.1%} relative")

4. Sanity Check Against Expected Effect

Is your MDE smaller than the effect you plausibly expect? If you're changing button color and your MDE is 20% relative lift, but no button color change in history has moved the needle by more than 5%... don't run the test.

5. Document Your Decision

If you proceed: pre-register your MDE, runtime, and decision criteria. If you don't: document why, and what would need to change.

Adjusting When Power Is Low

If your achievable MDE is too large, you have several options:

Increase Sample Size

Run longer: More days = more users
Broader targeting: Test on more pages or user segments
Higher allocation: 50/50 split is most powerful

Reduce Variance (CUPED)

Variance reduction techniques like CUPED can cut required sample size by 50% or more when you have strong pre-experiment covariates. See CUPED and Variance Reduction.

Accept Lower Power

Sometimes 70% power is acceptable, especially for low-stakes decisions. Just be explicit about the tradeoff.

Accept Larger MDE

If you truly believe the effect—if real—would be large, a larger MDE might be fine. A feature expected to move conversion by 30% doesn't need to detect 5% effects.

Use Composite Metrics

Combining multiple metrics into a single index can increase power by capturing more of the treatment effect.

Don't Run the Test

Sometimes the right answer is "this isn't testable with current traffic." Ship based on qualitative reasoning, or find a higher-traffic surface.

Post-Test: Interpreting Non-Significant Results

A non-significant result doesn't prove no effect exists. It means one of:

No effect exists
An effect exists but your test was underpowered to detect it

To distinguish, look at your confidence interval:

from statsmodels.stats.proportion import confint_proportions_2indep

# Example: 510/10000 control vs 520/10000 treatment
ci_low, ci_high = confint_proportions_2indep(
    count1=510, nobs1=10000,
    count2=520, nobs2=10000
)

print(f"95% CI for lift: [{ci_low:.4f}, {ci_high:.4f}]")
# If CI is [-0.8%, +1.0%], you can rule out large effects
# If CI is [-5%, +10%], you've learned almost nothing

A tight CI around zero is informative: no large effect exists. A wide CI that includes both large positive and negative effects means your test was underpowered.

A/B Testing Statistical Methods for Product Teams — The complete guide to A/B testing statistics
CUPED and Variance Reduction — How to detect smaller effects with the same sample
Power Analysis Without Cargo Culting — Deeper dive into power analysis

Frequently Asked Questions

Q: What power level should I use? A: 80% power is the industry standard. For high-stakes decisions, consider 90% (requires ~30% more sample). Below 70%, you're likely wasting time with inconclusive results.

Q: Should I use one-sided or two-sided tests for sample size? A: Use two-sided. You should care if a change makes things worse, and one-sided calculations give misleadingly optimistic sample sizes.

Q: How do I account for multiple variants? A: With k variants, you need roughly k times the total sample to maintain the same power for detecting the best variant vs. control. Consider whether you really need all those variants.

Q: What about ratio metrics (like revenue per user)? A: Ratio metrics require special treatment—their variance isn't straightforward to calculate. Use bootstrap or delta method for sample size estimation.

Q: My baseline rate changes over time. Which should I use? A: Use a recent, stable estimate. If rates are trending, your test results will be confounded anyway—consider whether it's the right time to test.

Key Takeaway

Before running any A/B test, calculate your achievable MDE given your traffic. If that MDE is larger than any plausible effect of your change, don't run the test—find a higher-traffic surface, use variance reduction techniques, or accept that the change isn't testable. An underpowered test wastes everyone's time with inconclusive results.

References

https://www.evanmiller.org/ab-testing/sample-size.html
https://www.stat.ubc.ca/~rollin/stats/ssize/b2.html
Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Lawrence Erlbaum Associates.
Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. *WSDM '13*.
van Belle, G. (2008). *Statistical Rules of Thumb* (2nd ed.). Wiley.

Frequently Asked Questions

What power level should I use?

80% power is the industry standard, meaning you'll detect a real effect 80% of the time. For high-stakes decisions, consider 90% power, which requires about 30% more sample.

My test came back not significant. Does that mean there's no effect?

Not necessarily. A non-significant result could mean no effect exists, or it could mean your test was underpowered to detect the effect. Check whether your achieved sample was sufficient for your target MDE.

Can I run a test with low power just to see?

This is generally a bad idea. Underpowered tests often show no significant effect even when a real effect exists, leading to false confidence that a change doesn't matter.

Key Takeaway

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email