Bayesian Methods

Bayesian A/B Testing: Posterior Probabilities for Ship Decisions

How to run Bayesian A/B tests that give you the probability a variant wins. Practical guide with Python code for conversion rates and revenue metrics.

Jan 296 min readstatstest_flow Bayesian Methods Supporting

Bayesian A/B Testing: Posterior Probabilities for Ship Decisions

Quick Hits

•Bayesian A/B tests output the probability a variant is best -- not a p-value
•You can check results at any time without inflating false positive rates
•The Beta-Binomial model handles conversion rates with just four numbers: successes and trials per variant
•Decision rules like 'ship if P(better) > 95%' are intuitive and customizable to your risk tolerance
•Expected loss quantifies the cost of a wrong decision in business terms

TL;DR

Bayesian A/B testing gives you the direct probability that a variant is better than control. Instead of "p < 0.05," you say "93% chance Variant B lifts conversion." This guide covers the Beta-Binomial model for conversion rates, Normal models for revenue, and decision rules based on expected loss.

Why Bayesian A/B Testing

The Frequentist Limitation

A standard A/B test tells you: "If there were no difference, the probability of seeing data this extreme is 3%." That is the p-value. It does not tell you:

The probability that B is better than A
How much better B likely is
Whether the improvement exceeds your minimum business threshold

The Bayesian Alternative

A Bayesian A/B test tells you directly:

P(B > A) = 94% -- the probability B has a higher conversion rate
Expected lift = 2.1% -- the most likely improvement
P(lift > 1%) = 88% -- the chance the lift exceeds your minimum threshold

These are the numbers product teams actually need to make ship decisions.

Beta-Binomial Model for Conversion Rates

How It Works

For binary outcomes (converted vs. did not convert), the Beta-Binomial model is the standard Bayesian approach:

Prior: Beta(alpha, beta) distribution for each variant's conversion rate
Data: Observe successes and trials
Posterior: Beta(alpha + successes, beta + failures) -- conjugate, no MCMC needed

import numpy as np
from scipy import stats

def bayesian_ab_test_binary(control_successes, control_trials,
                            treatment_successes, treatment_trials,
                            prior_alpha=1, prior_beta=1,
                            n_samples=100000):
    """
    Bayesian A/B test for binary outcomes (conversion rates).

    Parameters:
    -----------
    control_successes, control_trials : int
        Control group results
    treatment_successes, treatment_trials : int
        Treatment group results
    prior_alpha, prior_beta : float
        Beta prior parameters (1,1 = uniform)

    Returns:
    --------
    dict with posterior summaries and decision metrics
    """
    # Posterior distributions
    control_post = stats.beta(
        prior_alpha + control_successes,
        prior_beta + control_trials - control_successes
    )
    treatment_post = stats.beta(
        prior_alpha + treatment_successes,
        prior_beta + treatment_trials - treatment_successes
    )

    # Monte Carlo samples
    control_samples = control_post.rvs(n_samples)
    treatment_samples = treatment_post.rvs(n_samples)
    diff_samples = treatment_samples - control_samples
    relative_diff = diff_samples / control_samples

    # Decision metrics
    prob_treatment_better = np.mean(diff_samples > 0)
    expected_lift = np.mean(diff_samples)
    ci_low, ci_high = np.percentile(diff_samples, [2.5, 97.5])

    # Expected loss
    loss_ship = np.mean(np.maximum(-diff_samples, 0))
    loss_hold = np.mean(np.maximum(diff_samples, 0))

    return {
        'prob_treatment_better': prob_treatment_better,
        'expected_lift_absolute': expected_lift,
        'expected_lift_relative': np.mean(relative_diff),
        'ci_95': (ci_low, ci_high),
        'loss_if_ship': loss_ship,
        'loss_if_hold': loss_hold,
        'decision': 'Ship' if loss_ship < loss_hold else 'Hold',
        'control_mean': control_post.mean(),
        'treatment_mean': treatment_post.mean()
    }


# Example: checkout flow experiment
result = bayesian_ab_test_binary(
    control_successes=1150, control_trials=10000,
    treatment_successes=1230, treatment_trials=10000
)

print("Bayesian A/B Test Results")
print("=" * 50)
print(f"Control conversion rate:   {result['control_mean']:.2%}")
print(f"Treatment conversion rate: {result['treatment_mean']:.2%}")
print(f"\nP(Treatment > Control):    {result['prob_treatment_better']:.1%}")
print(f"Expected lift (absolute):  {result['expected_lift_absolute']:.4f}")
print(f"Expected lift (relative):  {result['expected_lift_relative']:.1%}")
print(f"95% credible interval:     [{result['ci_95'][0]:.4f}, {result['ci_95'][1]:.4f}]")
print(f"\nExpected loss if ship:     {result['loss_if_ship']:.6f}")
print(f"Expected loss if hold:     {result['loss_if_hold']:.6f}")
print(f"Decision:                  {result['decision']}")

Normal Model for Continuous Metrics

Revenue, Time on Page, Session Length

For continuous outcomes, use a Normal-Normal model:

def bayesian_ab_test_continuous(control_data, treatment_data,
                                prior_mean=0, prior_var=1e6,
                                n_samples=100000):
    """
    Bayesian A/B test for continuous outcomes.

    Uses Normal-Normal conjugate model with known variance
    approximation (large sample).
    """
    # Summary statistics
    c_mean, c_var, c_n = np.mean(control_data), np.var(control_data, ddof=1), len(control_data)
    t_mean, t_var, t_n = np.mean(treatment_data), np.var(treatment_data, ddof=1), len(treatment_data)

    # Posterior parameters (Normal-Normal conjugate)
    c_post_var = 1 / (1/prior_var + c_n/c_var)
    c_post_mean = c_post_var * (prior_mean/prior_var + c_n*c_mean/c_var)

    t_post_var = 1 / (1/prior_var + t_n/t_var)
    t_post_mean = t_post_var * (prior_mean/prior_var + t_n*t_mean/t_var)

    # Sample posteriors
    control_samples = np.random.normal(c_post_mean, np.sqrt(c_post_var), n_samples)
    treatment_samples = np.random.normal(t_post_mean, np.sqrt(t_post_var), n_samples)
    diff_samples = treatment_samples - control_samples

    prob_better = np.mean(diff_samples > 0)
    expected_diff = np.mean(diff_samples)
    ci = np.percentile(diff_samples, [2.5, 97.5])

    return {
        'prob_treatment_better': prob_better,
        'expected_difference': expected_diff,
        'ci_95': tuple(ci),
        'control_posterior_mean': c_post_mean,
        'treatment_posterior_mean': t_post_mean
    }


# Example: revenue per user
np.random.seed(42)
control_revenue = np.random.lognormal(2.5, 0.8, 5000)
treatment_revenue = np.random.lognormal(2.55, 0.8, 5000)

result = bayesian_ab_test_continuous(control_revenue, treatment_revenue)
print(f"P(Treatment > Control): {result['prob_treatment_better']:.1%}")
print(f"Expected revenue difference: ${result['expected_difference']:.2f}")
print(f"95% CI: [${result['ci_95'][0]:.2f}, ${result['ci_95'][1]:.2f}]")

Decision Rules

Threshold-Based

The simplest approach: ship if P(treatment better) exceeds a threshold.

Risk Level	Threshold	Use Case
Low risk	P > 90%	UI tweaks, copy changes
Medium risk	P > 95%	Feature changes, pricing
High risk	P > 99%	Infrastructure, irreversible changes

Expected Loss

A more sophisticated approach that accounts for the magnitude of potential mistakes:

def decision_by_loss(diff_samples, threshold=0.001):
    """
    Ship when expected loss from shipping is below threshold.

    threshold: maximum acceptable loss in conversion rate units
    """
    loss_if_ship = np.mean(np.maximum(-diff_samples, 0))
    loss_if_hold = np.mean(np.maximum(diff_samples, 0))

    return {
        'ship': loss_if_ship < threshold,
        'loss_if_ship': loss_if_ship,
        'loss_if_hold': loss_if_hold,
        'reason': f"Expected loss from shipping ({loss_if_ship:.6f}) {'<' if loss_if_ship < threshold else '>='} threshold ({threshold})"
    }

Expected loss is preferred because it penalizes confidently wrong decisions more than uncertain ones. A variant with P(better) = 70% but tiny expected downside might be worth shipping, while one with P(better) = 92% but large potential downside might not be.

Handling Multiple Variants

Bayesian methods extend naturally to more than two variants:

def bayesian_multivariant_test(variants, n_samples=100000):
    """
    Compare multiple variants simultaneously.

    variants: list of (successes, trials) tuples
    """
    posteriors = [
        stats.beta(1 + s, 1 + t - s).rvs(n_samples)
        for s, t in variants
    ]
    posteriors = np.array(posteriors)

    # Probability each variant is best
    best = np.argmax(posteriors, axis=0)
    prob_best = [np.mean(best == i) for i in range(len(variants))]

    return {
        f'variant_{i}': {
            'mean': np.mean(posteriors[i]),
            'prob_best': prob_best[i]
        }
        for i in range(len(variants))
    }


result = bayesian_multivariant_test([
    (1150, 10000),  # Control
    (1230, 10000),  # Variant A
    (1200, 10000),  # Variant B
    (1180, 10000),  # Variant C
])

for name, data in result.items():
    print(f"{name}: mean={data['mean']:.2%}, P(best)={data['prob_best']:.1%}")

No multiple comparison correction is needed. The joint posterior handles all comparisons simultaneously.

When to Stop the Experiment

Unlike frequentist tests, you do not need a fixed stopping time. But you still need guardrails:

Minimum sample size: Run for at least enough observations that the posterior is not dominated by the prior (e.g., 100 conversions per variant).
Expected loss threshold: Stop when expected loss from your decision drops below a business-meaningful level.
Maximum duration: Set a calendar deadline to avoid experiments that run forever.

def should_stop(diff_samples, min_samples_met=True,
                loss_threshold=0.0005, prob_threshold=0.95):
    """Check stopping criteria for Bayesian experiment."""
    if not min_samples_met:
        return False, "Minimum sample size not yet reached"

    loss_ship = np.mean(np.maximum(-diff_samples, 0))
    prob_better = np.mean(diff_samples > 0)

    if loss_ship < loss_threshold:
        return True, f"Expected loss ({loss_ship:.6f}) below threshold -- ship"
    if (1 - prob_better) < (1 - prob_threshold) and prob_better < prob_threshold:
        return False, f"Not yet confident: P(better) = {prob_better:.1%}"
    if prob_better >= prob_threshold:
        return True, f"P(better) = {prob_better:.1%} -- ship"
    if prob_better <= (1 - prob_threshold):
        return True, f"P(better) = {prob_better:.1%} -- hold, variant is worse"

    return False, "Continue collecting data"

Bayesian Methods Overview (Pillar) - When and why to go Bayesian
Bayesian vs. Frequentist - Comparing the two paradigms
Credible Intervals - Interpreting Bayesian intervals
Bayesian Sample Size - Planning experiments
Practical Bayes Tools - PyMC, Stan, and brms

Key Takeaway

Bayesian A/B testing replaces p-values with direct probability statements: "There is a 94% chance Variant B has a higher conversion rate." This makes communication with stakeholders straightforward and decisions more transparent. Use the Beta-Binomial model for conversion rates, Normal models for continuous metrics, and expected loss for decision rules that map to business impact.

References

https://doi.org/10.1214/06-BA101
https://www.evanmiller.org/bayesian-ab-testing.html
https://docs.pymc.io/en/latest/learn/core_notebooks/pymc_overview.html

Frequently Asked Questions

Can I peek at Bayesian A/B test results early?

Yes. Unlike frequentist tests, Bayesian inference does not require a fixed sample size to control error rates. You can check the posterior at any time. However, stopping the moment P(better) crosses your threshold can still lead to overconfident decisions with very small samples. Use a minimum sample size or expected loss threshold to guard against this.

How do I set a decision threshold?

Common choices are P(better) > 95% for low-risk changes and P(better) > 99% for high-risk ones. Alternatively, use expected loss: ship when the expected loss from shipping is below a business-meaningful threshold (e.g., less than 0.1% conversion rate loss). The right threshold depends on the cost of being wrong.

Does Bayesian testing need a different sample size?

Bayesian tests do not require a fixed sample size in the same way, but you still need enough data for the posterior to be precise. With very small samples the posterior is wide and dominated by the prior, making conclusions unreliable. Simulation-based power analysis can help you plan Bayesian experiments.

Key Takeaway

Bayesian A/B testing replaces p-values with direct probability statements: 'There is a 94% chance Variant B has a higher conversion rate.' This makes communication with stakeholders straightforward and decisions more transparent. Use the Beta-Binomial model for conversion rates, Normal models for continuous metrics, and expected loss for decision rules that map to business impact.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email