Contents
Minimum Detectable Effect and Sample Size: A Practical Guide
Learn how to calculate the minimum detectable effect for your A/B test, determine required sample sizes, and understand the tradeoffs between statistical power and practical constraints.
Quick Hits
- •MDE and sample size are inversely related—halving your MDE requires 4x the sample
- •Calculate your achievable MDE before running a test, not after
- •An underpowered test is worse than no test—you'll get inconclusive results and waste time
- •80% power is standard, but consider 90% for high-stakes decisions
TL;DR
Before running any A/B test, you need to know two things: how many users you'll have, and what's the smallest effect you could reliably detect with that many users. That smallest detectable effect is your Minimum Detectable Effect (MDE). If your MDE is larger than any realistic impact of your change, the test isn't worth running.
The Fundamental Relationship
Four quantities govern experiment design:
- Baseline rate (p₀): Your current conversion rate or metric mean
- Minimum Detectable Effect (MDE): The smallest effect you want to detect
- Significance level (α): False positive rate, typically 0.05
- Power (1-β): Probability of detecting a real effect, typically 0.80
Given any three, you can calculate the fourth. In practice:
- α and power are typically fixed at 0.05 and 0.80
- Either you have a target MDE and need to find sample size, or
- You have a fixed sample and need to find your achievable MDE
The Sample Size Formula (for proportions)
For comparing two proportions with equal group sizes:
$$n = \frac{2 \cdot (z_{1-\alpha/2} + z_{1-\beta})^2 \cdot p(1-p)}{(p_1 - p_0)^2}$$
Where:
- $n$ = sample size per group
- $z_{1-\alpha/2}$ ≈ 1.96 for α = 0.05 (two-sided)
- $z_{1-\beta}$ ≈ 0.84 for 80% power
- $p$ = pooled proportion ≈ $(p_0 + p_1)/2$
- $(p_1 - p_0)$ = absolute effect (your MDE)
The Critical Insight: Quadratic Relationship
Sample size scales with the square of precision. Detecting half the effect requires four times the sample:
| Relative MDE | Sample Multiplier |
|---|---|
| 1x (baseline) | 1x |
| 0.5x | 4x |
| 0.25x | 16x |
| 0.1x | 100x |
This quadratic relationship is why small effects require enormous samples.
Calculating Sample Size
For Conversion Rates (Binary Metrics)
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
def sample_size_for_proportions(baseline_rate, relative_lift, alpha=0.05, power=0.80):
"""
Calculate required sample size per group for a proportion test.
Args:
baseline_rate: Current conversion rate (e.g., 0.05 for 5%)
relative_lift: Relative change to detect (e.g., 0.10 for 10% lift)
alpha: Significance level (default 0.05)
power: Statistical power (default 0.80)
Returns:
Required sample size per group
"""
# Calculate absolute change
absolute_lift = baseline_rate * relative_lift
new_rate = baseline_rate + absolute_lift
# Cohen's h effect size for proportions
effect_size = proportion_effectsize(baseline_rate, new_rate)
# Calculate sample size
analysis = NormalIndPower()
n = analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
return int(np.ceil(n))
# Example: 5% baseline, want to detect 10% relative lift
baseline = 0.05
relative_lift = 0.10 # detect 5% → 5.5%
n_per_group = sample_size_for_proportions(baseline, relative_lift)
print(f"Sample needed per group: {n_per_group:,}")
# Output: Sample needed per group: 31,234
For Continuous Metrics (Revenue, Time)
from statsmodels.stats.power import TTestIndPower
def sample_size_for_means(baseline_mean, baseline_std, relative_lift, alpha=0.05, power=0.80):
"""
Calculate required sample size per group for a t-test.
Args:
baseline_mean: Current metric mean
baseline_std: Current metric standard deviation
relative_lift: Relative change to detect
alpha: Significance level
power: Statistical power
Returns:
Required sample size per group
"""
absolute_lift = baseline_mean * relative_lift
effect_size = absolute_lift / baseline_std # Cohen's d
analysis = TTestIndPower()
n = analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
return int(np.ceil(n))
# Example: mean revenue $50, std $100, want to detect 5% lift
n_per_group = sample_size_for_means(
baseline_mean=50,
baseline_std=100,
relative_lift=0.05
)
print(f"Sample needed per group: {n_per_group:,}")
R Implementation
# For proportions
library(pwr)
baseline <- 0.05
new_rate <- 0.055 # 10% relative lift
# Cohen's h effect size
h <- 2 * asin(sqrt(new_rate)) - 2 * asin(sqrt(baseline))
result <- pwr.2p.test(h = h, sig.level = 0.05, power = 0.80)
print(ceiling(result$n))
# For means
effect_size <- 2.5 / 100 # $2.50 lift / $100 std = Cohen's d of 0.025
result <- pwr.t.test(d = effect_size, sig.level = 0.05, power = 0.80, type = "two.sample")
print(ceiling(result$n))
Calculating Achievable MDE
More often, you have a fixed traffic budget and need to know what you can detect:
import numpy as np
from scipy import stats
def achievable_mde_proportions(baseline_rate, n_per_group, alpha=0.05, power=0.80):
"""
Calculate the minimum detectable effect given a fixed sample size.
Args:
baseline_rate: Current conversion rate
n_per_group: Available sample per group
alpha: Significance level
power: Statistical power
Returns:
Absolute MDE (add to baseline for new rate)
"""
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)
# Approximate formula for MDE
se = np.sqrt(2 * baseline_rate * (1 - baseline_rate) / n_per_group)
mde = (z_alpha + z_beta) * se
return mde
# Example: 5% baseline, 10,000 users per group
baseline = 0.05
n = 10000
mde = achievable_mde_proportions(baseline, n)
relative_mde = mde / baseline
print(f"Achievable MDE: {mde:.4f} ({relative_mde:.1%} relative)")
# Output: Achievable MDE: 0.0086 (17.3% relative)
Common Sample Size Scenarios
Here's a reference table for conversion rate tests (α = 0.05, power = 80%):
| Baseline Rate | 5% Relative Lift | 10% Relative Lift | 20% Relative Lift |
|---|---|---|---|
| 1% | 3,141,592 | 785,398 | 196,350 |
| 2% | 1,545,098 | 386,275 | 96,569 |
| 5% | 608,265 | 152,066 | 38,017 |
| 10% | 291,671 | 72,918 | 18,230 |
| 20% | 130,075 | 32,519 | 8,130 |
| 50% | 32,019 | 8,005 | 2,001 |
Key observations:
- Lower baseline rates require dramatically more sample
- A 5% relative lift requires 4x the sample of a 10% relative lift
- High-conversion metrics (50% baseline) are much easier to test
The Pre-Test Checklist
Before launching any experiment:
1. Know Your Traffic
How many eligible users will enter the experiment per day/week? Be conservative—actual traffic is often lower than expected due to:
- Eligibility criteria
- Technical issues
- Seasonal variation
2. Set Your Runtime
How long can you run the test? Consider:
- Product cycles and launch pressure
- Weekly and seasonal patterns (run at least one full week)
- User patience with your stakeholders
3. Calculate Your MDE
With traffic × runtime = total sample, what's your achievable MDE?
# Example calculation
daily_traffic = 5000 # users per day entering experiment
test_duration_days = 14 # two weeks
allocation = 0.50 # 50% to each variant
total_sample = daily_traffic * test_duration_days
n_per_group = total_sample * allocation
mde = achievable_mde_proportions(baseline_rate=0.05, n_per_group=n_per_group)
print(f"With {n_per_group:,.0f} users per group, MDE is {mde/0.05:.1%} relative")
4. Sanity Check Against Expected Effect
Is your MDE smaller than the effect you plausibly expect? If you're changing button color and your MDE is 20% relative lift, but no button color change in history has moved the needle by more than 5%... don't run the test.
5. Document Your Decision
If you proceed: pre-register your MDE, runtime, and decision criteria. If you don't: document why, and what would need to change.
Adjusting When Power Is Low
If your achievable MDE is too large, you have several options:
Increase Sample Size
- Run longer: More days = more users
- Broader targeting: Test on more pages or user segments
- Higher allocation: 50/50 split is most powerful
Reduce Variance (CUPED)
Variance reduction techniques like CUPED can cut required sample size by 50% or more when you have strong pre-experiment covariates. See CUPED and Variance Reduction.
Accept Lower Power
Sometimes 70% power is acceptable, especially for low-stakes decisions. Just be explicit about the tradeoff.
Accept Larger MDE
If you truly believe the effect—if real—would be large, a larger MDE might be fine. A feature expected to move conversion by 30% doesn't need to detect 5% effects.
Use Composite Metrics
Combining multiple metrics into a single index can increase power by capturing more of the treatment effect.
Don't Run the Test
Sometimes the right answer is "this isn't testable with current traffic." Ship based on qualitative reasoning, or find a higher-traffic surface.
Post-Test: Interpreting Non-Significant Results
A non-significant result doesn't prove no effect exists. It means one of:
- No effect exists
- An effect exists but your test was underpowered to detect it
To distinguish, look at your confidence interval:
from statsmodels.stats.proportion import confint_proportions_2indep
# Example: 510/10000 control vs 520/10000 treatment
ci_low, ci_high = confint_proportions_2indep(
count1=510, nobs1=10000,
count2=520, nobs2=10000
)
print(f"95% CI for lift: [{ci_low:.4f}, {ci_high:.4f}]")
# If CI is [-0.8%, +1.0%], you can rule out large effects
# If CI is [-5%, +10%], you've learned almost nothing
A tight CI around zero is informative: no large effect exists. A wide CI that includes both large positive and negative effects means your test was underpowered.
Related Methods
- A/B Testing Statistical Methods for Product Teams — The complete guide to A/B testing statistics
- CUPED and Variance Reduction — How to detect smaller effects with the same sample
- Power Analysis Without Cargo Culting — Deeper dive into power analysis
Frequently Asked Questions
Q: What power level should I use? A: 80% power is the industry standard. For high-stakes decisions, consider 90% (requires ~30% more sample). Below 70%, you're likely wasting time with inconclusive results.
Q: Should I use one-sided or two-sided tests for sample size? A: Use two-sided. You should care if a change makes things worse, and one-sided calculations give misleadingly optimistic sample sizes.
Q: How do I account for multiple variants? A: With k variants, you need roughly k times the total sample to maintain the same power for detecting the best variant vs. control. Consider whether you really need all those variants.
Q: What about ratio metrics (like revenue per user)? A: Ratio metrics require special treatment—their variance isn't straightforward to calculate. Use bootstrap or delta method for sample size estimation.
Q: My baseline rate changes over time. Which should I use? A: Use a recent, stable estimate. If rates are trending, your test results will be confounded anyway—consider whether it's the right time to test.
Key Takeaway
Before running any A/B test, calculate your achievable MDE given your traffic. If that MDE is larger than any plausible effect of your change, don't run the test—find a higher-traffic surface, use variance reduction techniques, or accept that the change isn't testable. An underpowered test wastes everyone's time with inconclusive results.
References
- https://www.evanmiller.org/ab-testing/sample-size.html
- https://www.stat.ubc.ca/~rollin/stats/ssize/b2.html
- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Lawrence Erlbaum Associates.
- Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. *WSDM '13*.
- van Belle, G. (2008). *Statistical Rules of Thumb* (2nd ed.). Wiley.
Frequently Asked Questions
What power level should I use?
My test came back not significant. Does that mean there's no effect?
Can I run a test with low power just to see?
Key Takeaway
Before running any A/B test, calculate your achievable MDE given your traffic. If that MDE is larger than any plausible effect of your change, don't run the test—find a higher-traffic surface or accept that the change isn't testable.