Contents
Bayesian A/B Testing: Posterior Probabilities for Ship Decisions
How to run Bayesian A/B tests that give you the probability a variant wins. Practical guide with Python code for conversion rates and revenue metrics.
Quick Hits
- •Bayesian A/B tests output the probability a variant is best -- not a p-value
- •You can check results at any time without inflating false positive rates
- •The Beta-Binomial model handles conversion rates with just four numbers: successes and trials per variant
- •Decision rules like 'ship if P(better) > 95%' are intuitive and customizable to your risk tolerance
- •Expected loss quantifies the cost of a wrong decision in business terms
TL;DR
Bayesian A/B testing gives you the direct probability that a variant is better than control. Instead of "p < 0.05," you say "93% chance Variant B lifts conversion." This guide covers the Beta-Binomial model for conversion rates, Normal models for revenue, and decision rules based on expected loss.
Why Bayesian A/B Testing
The Frequentist Limitation
A standard A/B test tells you: "If there were no difference, the probability of seeing data this extreme is 3%." That is the p-value. It does not tell you:
- The probability that B is better than A
- How much better B likely is
- Whether the improvement exceeds your minimum business threshold
The Bayesian Alternative
A Bayesian A/B test tells you directly:
- P(B > A) = 94% -- the probability B has a higher conversion rate
- Expected lift = 2.1% -- the most likely improvement
- P(lift > 1%) = 88% -- the chance the lift exceeds your minimum threshold
These are the numbers product teams actually need to make ship decisions.
Beta-Binomial Model for Conversion Rates
How It Works
For binary outcomes (converted vs. did not convert), the Beta-Binomial model is the standard Bayesian approach:
- Prior: Beta(alpha, beta) distribution for each variant's conversion rate
- Data: Observe successes and trials
- Posterior: Beta(alpha + successes, beta + failures) -- conjugate, no MCMC needed
import numpy as np
from scipy import stats
def bayesian_ab_test_binary(control_successes, control_trials,
treatment_successes, treatment_trials,
prior_alpha=1, prior_beta=1,
n_samples=100000):
"""
Bayesian A/B test for binary outcomes (conversion rates).
Parameters:
-----------
control_successes, control_trials : int
Control group results
treatment_successes, treatment_trials : int
Treatment group results
prior_alpha, prior_beta : float
Beta prior parameters (1,1 = uniform)
Returns:
--------
dict with posterior summaries and decision metrics
"""
# Posterior distributions
control_post = stats.beta(
prior_alpha + control_successes,
prior_beta + control_trials - control_successes
)
treatment_post = stats.beta(
prior_alpha + treatment_successes,
prior_beta + treatment_trials - treatment_successes
)
# Monte Carlo samples
control_samples = control_post.rvs(n_samples)
treatment_samples = treatment_post.rvs(n_samples)
diff_samples = treatment_samples - control_samples
relative_diff = diff_samples / control_samples
# Decision metrics
prob_treatment_better = np.mean(diff_samples > 0)
expected_lift = np.mean(diff_samples)
ci_low, ci_high = np.percentile(diff_samples, [2.5, 97.5])
# Expected loss
loss_ship = np.mean(np.maximum(-diff_samples, 0))
loss_hold = np.mean(np.maximum(diff_samples, 0))
return {
'prob_treatment_better': prob_treatment_better,
'expected_lift_absolute': expected_lift,
'expected_lift_relative': np.mean(relative_diff),
'ci_95': (ci_low, ci_high),
'loss_if_ship': loss_ship,
'loss_if_hold': loss_hold,
'decision': 'Ship' if loss_ship < loss_hold else 'Hold',
'control_mean': control_post.mean(),
'treatment_mean': treatment_post.mean()
}
# Example: checkout flow experiment
result = bayesian_ab_test_binary(
control_successes=1150, control_trials=10000,
treatment_successes=1230, treatment_trials=10000
)
print("Bayesian A/B Test Results")
print("=" * 50)
print(f"Control conversion rate: {result['control_mean']:.2%}")
print(f"Treatment conversion rate: {result['treatment_mean']:.2%}")
print(f"\nP(Treatment > Control): {result['prob_treatment_better']:.1%}")
print(f"Expected lift (absolute): {result['expected_lift_absolute']:.4f}")
print(f"Expected lift (relative): {result['expected_lift_relative']:.1%}")
print(f"95% credible interval: [{result['ci_95'][0]:.4f}, {result['ci_95'][1]:.4f}]")
print(f"\nExpected loss if ship: {result['loss_if_ship']:.6f}")
print(f"Expected loss if hold: {result['loss_if_hold']:.6f}")
print(f"Decision: {result['decision']}")
Normal Model for Continuous Metrics
Revenue, Time on Page, Session Length
For continuous outcomes, use a Normal-Normal model:
def bayesian_ab_test_continuous(control_data, treatment_data,
prior_mean=0, prior_var=1e6,
n_samples=100000):
"""
Bayesian A/B test for continuous outcomes.
Uses Normal-Normal conjugate model with known variance
approximation (large sample).
"""
# Summary statistics
c_mean, c_var, c_n = np.mean(control_data), np.var(control_data, ddof=1), len(control_data)
t_mean, t_var, t_n = np.mean(treatment_data), np.var(treatment_data, ddof=1), len(treatment_data)
# Posterior parameters (Normal-Normal conjugate)
c_post_var = 1 / (1/prior_var + c_n/c_var)
c_post_mean = c_post_var * (prior_mean/prior_var + c_n*c_mean/c_var)
t_post_var = 1 / (1/prior_var + t_n/t_var)
t_post_mean = t_post_var * (prior_mean/prior_var + t_n*t_mean/t_var)
# Sample posteriors
control_samples = np.random.normal(c_post_mean, np.sqrt(c_post_var), n_samples)
treatment_samples = np.random.normal(t_post_mean, np.sqrt(t_post_var), n_samples)
diff_samples = treatment_samples - control_samples
prob_better = np.mean(diff_samples > 0)
expected_diff = np.mean(diff_samples)
ci = np.percentile(diff_samples, [2.5, 97.5])
return {
'prob_treatment_better': prob_better,
'expected_difference': expected_diff,
'ci_95': tuple(ci),
'control_posterior_mean': c_post_mean,
'treatment_posterior_mean': t_post_mean
}
# Example: revenue per user
np.random.seed(42)
control_revenue = np.random.lognormal(2.5, 0.8, 5000)
treatment_revenue = np.random.lognormal(2.55, 0.8, 5000)
result = bayesian_ab_test_continuous(control_revenue, treatment_revenue)
print(f"P(Treatment > Control): {result['prob_treatment_better']:.1%}")
print(f"Expected revenue difference: ${result['expected_difference']:.2f}")
print(f"95% CI: [${result['ci_95'][0]:.2f}, ${result['ci_95'][1]:.2f}]")
Decision Rules
Threshold-Based
The simplest approach: ship if P(treatment better) exceeds a threshold.
| Risk Level | Threshold | Use Case |
|---|---|---|
| Low risk | P > 90% | UI tweaks, copy changes |
| Medium risk | P > 95% | Feature changes, pricing |
| High risk | P > 99% | Infrastructure, irreversible changes |
Expected Loss
A more sophisticated approach that accounts for the magnitude of potential mistakes:
def decision_by_loss(diff_samples, threshold=0.001):
"""
Ship when expected loss from shipping is below threshold.
threshold: maximum acceptable loss in conversion rate units
"""
loss_if_ship = np.mean(np.maximum(-diff_samples, 0))
loss_if_hold = np.mean(np.maximum(diff_samples, 0))
return {
'ship': loss_if_ship < threshold,
'loss_if_ship': loss_if_ship,
'loss_if_hold': loss_if_hold,
'reason': f"Expected loss from shipping ({loss_if_ship:.6f}) {'<' if loss_if_ship < threshold else '>='} threshold ({threshold})"
}
Expected loss is preferred because it penalizes confidently wrong decisions more than uncertain ones. A variant with P(better) = 70% but tiny expected downside might be worth shipping, while one with P(better) = 92% but large potential downside might not be.
Handling Multiple Variants
Bayesian methods extend naturally to more than two variants:
def bayesian_multivariant_test(variants, n_samples=100000):
"""
Compare multiple variants simultaneously.
variants: list of (successes, trials) tuples
"""
posteriors = [
stats.beta(1 + s, 1 + t - s).rvs(n_samples)
for s, t in variants
]
posteriors = np.array(posteriors)
# Probability each variant is best
best = np.argmax(posteriors, axis=0)
prob_best = [np.mean(best == i) for i in range(len(variants))]
return {
f'variant_{i}': {
'mean': np.mean(posteriors[i]),
'prob_best': prob_best[i]
}
for i in range(len(variants))
}
result = bayesian_multivariant_test([
(1150, 10000), # Control
(1230, 10000), # Variant A
(1200, 10000), # Variant B
(1180, 10000), # Variant C
])
for name, data in result.items():
print(f"{name}: mean={data['mean']:.2%}, P(best)={data['prob_best']:.1%}")
No multiple comparison correction is needed. The joint posterior handles all comparisons simultaneously.
When to Stop the Experiment
Unlike frequentist tests, you do not need a fixed stopping time. But you still need guardrails:
- Minimum sample size: Run for at least enough observations that the posterior is not dominated by the prior (e.g., 100 conversions per variant).
- Expected loss threshold: Stop when expected loss from your decision drops below a business-meaningful level.
- Maximum duration: Set a calendar deadline to avoid experiments that run forever.
def should_stop(diff_samples, min_samples_met=True,
loss_threshold=0.0005, prob_threshold=0.95):
"""Check stopping criteria for Bayesian experiment."""
if not min_samples_met:
return False, "Minimum sample size not yet reached"
loss_ship = np.mean(np.maximum(-diff_samples, 0))
prob_better = np.mean(diff_samples > 0)
if loss_ship < loss_threshold:
return True, f"Expected loss ({loss_ship:.6f}) below threshold -- ship"
if (1 - prob_better) < (1 - prob_threshold) and prob_better < prob_threshold:
return False, f"Not yet confident: P(better) = {prob_better:.1%}"
if prob_better >= prob_threshold:
return True, f"P(better) = {prob_better:.1%} -- ship"
if prob_better <= (1 - prob_threshold):
return True, f"P(better) = {prob_better:.1%} -- hold, variant is worse"
return False, "Continue collecting data"
Related Methods
- Bayesian Methods Overview (Pillar) - When and why to go Bayesian
- Bayesian vs. Frequentist - Comparing the two paradigms
- Credible Intervals - Interpreting Bayesian intervals
- Bayesian Sample Size - Planning experiments
- Practical Bayes Tools - PyMC, Stan, and brms
Key Takeaway
Bayesian A/B testing replaces p-values with direct probability statements: "There is a 94% chance Variant B has a higher conversion rate." This makes communication with stakeholders straightforward and decisions more transparent. Use the Beta-Binomial model for conversion rates, Normal models for continuous metrics, and expected loss for decision rules that map to business impact.
References
- https://doi.org/10.1214/06-BA101
- https://www.evanmiller.org/bayesian-ab-testing.html
- https://docs.pymc.io/en/latest/learn/core_notebooks/pymc_overview.html
Frequently Asked Questions
Can I peek at Bayesian A/B test results early?
How do I set a decision threshold?
Does Bayesian testing need a different sample size?
Key Takeaway
Bayesian A/B testing replaces p-values with direct probability statements: 'There is a 94% chance Variant B has a higher conversion rate.' This makes communication with stakeholders straightforward and decisions more transparent. Use the Beta-Binomial model for conversion rates, Normal models for continuous metrics, and expected loss for decision rules that map to business impact.