Bayesian Methods

Bayesian Sample Size: Why It's Different and When It Helps

How to plan sample sizes for Bayesian experiments. Learn why Bayesian sample sizing differs from frequentist, and when it gives you smaller or more flexible experiments.

Jan 296 min readstatstest_flow Bayesian Methods Supporting

Bayesian Sample Size: Why It's Different and When It Helps

Quick Hits

•Bayesian experiments do not require a fixed sample size -- but you still need enough data for useful conclusions
•Instead of power, think posterior precision: how narrow does your credible interval need to be?
•Informative priors effectively add 'free' observations, reducing the data you need to collect
•Simulation-based planning lets you estimate how many observations you need for a given posterior width
•Bayesian stopping rules do not inflate error rates, unlike peeking in frequentist tests

TL;DR

Bayesian experiments do not require a fixed sample size determined by power analysis. Instead, you plan based on posterior precision: how many observations do you need for a credible interval narrow enough to make a decision? This guide covers simulation-based planning, the role of priors in sample size, and when Bayesian flexibility actually helps.

Why Bayesian Sample Size Is Different

Frequentist: Fixed Sample, Binary Outcome

Frequentist power analysis asks: "How many observations do I need to have an 80% chance of detecting a 2% effect at alpha = 0.05?"

This gives you a fixed number. You must collect exactly that many observations. Stopping early inflates your false positive rate. Stopping late wastes resources.

Bayesian: Flexible Sample, Continuous Monitoring

Bayesian planning asks: "How many observations do I need for my posterior to be precise enough to make a confident decision?"

There is no single magic number. You can check the posterior at any time. The posterior is always valid -- it just gets more precise as you collect more data.

import numpy as np
from scipy import stats

def posterior_width_by_sample_size(true_rate=0.12, effect=0.02,
                                    sample_sizes=[100, 250, 500, 1000, 2500, 5000],
                                    prior_a=1, prior_b=1, n_simulations=500):
    """
    Simulate posterior width as a function of sample size.
    """
    print(f"True control rate: {true_rate:.1%}")
    print(f"True effect: {effect:.1%}")
    print(f"Prior: Beta({prior_a}, {prior_b})")
    print(f"\n{'n per arm':<12} {'Median CI Width':<18} {'P(detect effect)':<20} {'Median P(B>A)'}")
    print("-" * 70)

    for n in sample_sizes:
        ci_widths = []
        detected = 0
        p_better_list = []

        for _ in range(n_simulations):
            # Simulate data
            control_conv = np.random.binomial(n, true_rate)
            treatment_conv = np.random.binomial(n, true_rate + effect)

            # Posterior samples
            c_post = stats.beta(prior_a + control_conv, prior_b + n - control_conv)
            t_post = stats.beta(prior_a + treatment_conv, prior_b + n - treatment_conv)

            c_samples = c_post.rvs(10000)
            t_samples = t_post.rvs(10000)
            diff = t_samples - c_samples

            # Metrics
            ci = np.percentile(diff, [2.5, 97.5])
            ci_widths.append(ci[1] - ci[0])
            p_better = np.mean(diff > 0)
            p_better_list.append(p_better)
            if p_better > 0.95:
                detected += 1

        median_width = np.median(ci_widths)
        detection_rate = detected / n_simulations
        median_p = np.median(p_better_list)

        print(f"{n:<12} {median_width:<18.4f} {detection_rate:<20.1%} {median_p:.1%}")


posterior_width_by_sample_size()

Simulation-Based Sample Size Planning

The Core Approach

Define the minimum effect size you care about
Choose your prior
Set your decision criterion (e.g., 95% credible interval excludes zero, or P(better) > 95%)
Simulate experiments at various sample sizes
Find the smallest n where your criterion is met most of the time

def bayesian_sample_size(target_effect=0.02, base_rate=0.12,
                          desired_precision=0.02, confidence=0.80,
                          prior_a=1, prior_b=1, n_simulations=1000):
    """
    Find sample size for desired posterior precision.

    desired_precision: maximum 95% CI width you'll accept
    confidence: proportion of simulations that should achieve the precision
    """
    for n in [50, 100, 200, 500, 1000, 2000, 5000, 10000]:
        precise_count = 0

        for _ in range(n_simulations):
            control_conv = np.random.binomial(n, base_rate)
            treatment_conv = np.random.binomial(n, base_rate + target_effect)

            c_samples = stats.beta(prior_a + control_conv, prior_b + n - control_conv).rvs(5000)
            t_samples = stats.beta(prior_a + treatment_conv, prior_b + n - treatment_conv).rvs(5000)
            diff = t_samples - c_samples

            ci_width = np.percentile(diff, 97.5) - np.percentile(diff, 2.5)
            if ci_width <= desired_precision:
                precise_count += 1

        prop_precise = precise_count / n_simulations
        if prop_precise >= confidence:
            print(f"Recommended: n = {n} per arm")
            print(f"  {prop_precise:.0%} of simulations achieve CI width <= {desired_precision}")
            return n

    print("Need more than 10,000 per arm")
    return None


recommended_n = bayesian_sample_size(
    target_effect=0.02,
    base_rate=0.12,
    desired_precision=0.03,
    confidence=0.80
)

How Priors Affect Sample Size

Informative Priors as "Free Data"

An informative prior is mathematically equivalent to having already observed data. A Beta(12, 88) prior on a conversion rate acts like 100 prior observations with a 12% rate.

def prior_impact_on_sample_size(true_rate=0.12, effect=0.02,
                                 priors={'Flat Beta(1,1)': (1, 1),
                                         'Weak Beta(2,18)': (2, 18),
                                         'Informative Beta(12,88)': (12, 88)},
                                 threshold_prob=0.95,
                                 n_simulations=500):
    """
    Show how different priors affect required sample size.
    """
    print(f"Effect to detect: {effect:.1%}")
    print(f"Decision criterion: P(B > A) > {threshold_prob:.0%}")
    print(f"\n{'Prior':<30} {'n=200':<12} {'n=500':<12} {'n=1000':<12} {'n=2000'}")
    print("-" * 70)

    for prior_name, (a, b) in priors.items():
        results = []
        for n in [200, 500, 1000, 2000]:
            detected = 0
            for _ in range(n_simulations):
                cc = np.random.binomial(n, true_rate)
                tc = np.random.binomial(n, true_rate + effect)

                c_s = stats.beta(a + cc, b + n - cc).rvs(5000)
                t_s = stats.beta(a + tc, b + n - tc).rvs(5000)
                p_better = np.mean(t_s > c_s)
                if p_better > threshold_prob:
                    detected += 1
            results.append(f"{detected/n_simulations:.0%}")

        print(f"{prior_name:<30} {'  '.join(f'{r:<12}' for r in results)}")


prior_impact_on_sample_size()

Informative priors reduce the sample needed -- but only if the prior accurately reflects reality. A wrong informative prior can slow convergence to the truth.

Bayesian Stopping Rules

Why You Can Peek

In frequentist testing, peeking inflates the false positive rate because each look is an opportunity to stop when noise looks like signal.

Bayesian inference does not have this problem. The posterior at any point is a valid summary of what you know given the data so far. There is no "multiple testing" penalty.

Practical Stopping Criteria

def monitor_experiment(control_stream, treatment_stream,
                       min_per_arm=100, max_per_arm=10000,
                       loss_threshold=0.0005, check_every=50):
    """
    Monitor a Bayesian experiment with stopping rules.
    """
    c_successes, c_total = 0, 0
    t_successes, t_total = 0, 0

    for i in range(max_per_arm):
        # Accumulate data
        c_successes += control_stream[i]
        c_total += 1
        t_successes += treatment_stream[i]
        t_total += 1

        # Check at intervals
        if c_total >= min_per_arm and c_total % check_every == 0:
            c_samples = stats.beta(1 + c_successes, 1 + c_total - c_successes).rvs(10000)
            t_samples = stats.beta(1 + t_successes, 1 + t_total - t_successes).rvs(10000)
            diff = t_samples - c_samples

            loss_ship = np.mean(np.maximum(-diff, 0))
            loss_hold = np.mean(np.maximum(diff, 0))
            p_better = np.mean(diff > 0)

            if loss_ship < loss_threshold:
                return {
                    'decision': 'Ship',
                    'n_per_arm': c_total,
                    'p_better': p_better,
                    'expected_loss': loss_ship
                }
            if loss_hold < loss_threshold:
                return {
                    'decision': 'Hold (control is better)',
                    'n_per_arm': c_total,
                    'p_better': p_better,
                    'expected_loss': loss_hold
                }

    return {'decision': 'Inconclusive', 'n_per_arm': max_per_arm}


# Simulate data stream
np.random.seed(42)
control = np.random.binomial(1, 0.12, 10000)
treatment = np.random.binomial(1, 0.14, 10000)

result = monitor_experiment(control, treatment)
print(f"Decision: {result['decision']} at n={result['n_per_arm']} per arm")
print(f"P(Treatment better): {result.get('p_better', 'N/A'):.1%}")

Comparison with Frequentist Planning

Aspect	Frequentist Power Analysis	Bayesian Planning
Goal	Detect effect at given power/alpha	Achieve desired posterior precision
Output	Fixed sample size	Range or minimum sample size
Stopping	Fixed (or sequential with correction)	Flexible (check anytime)
Priors	Not used	Reduce required sample size
Computation	Closed-form formulas	Simulation-based
Peeking	Inflates false positive rate	No inflation

Practical Recommendations

Always set a minimum sample size: Even with Bayesian methods, tiny samples produce posteriors dominated by the prior. Aim for at least 100 events (conversions, clicks) per variant.
Use simulation for planning: There is no simple formula. Simulate data under your expected effect, compute posteriors, and find the sample size that gives you useful precision.
Include priors in planning: If you have informative priors, factor them in. They reduce the data you need.
Set a maximum duration: Experiments that run forever waste traffic. Set a calendar deadline even if using flexible stopping.

Bayesian Methods Overview (Pillar) - Full Bayesian framework
Bayesian A/B Testing - Running the experiment
Prior Selection - Priors affect sample size
Bayesian vs. Frequentist - Comparing planning approaches

Key Takeaway

Bayesian sample size planning focuses on posterior precision rather than frequentist power. You ask "How many observations until my credible interval is narrow enough to make a decision?" instead of "How many observations to achieve 80% power?" This approach is more flexible, allows early stopping without error inflation, and naturally incorporates prior information. Use simulation to plan: generate datasets under your expected effect, compute posteriors, and find the sample size where the posterior is precise enough for your decision.

References

https://doi.org/10.1080/00031305.2018.1564697
https://doi.org/10.3758/s13423-015-0947-8
https://mc-stan.org/users/documentation/

Frequently Asked Questions

Does Bayesian testing require fewer samples?

Not necessarily. With uninformative priors, Bayesian and frequentist sample sizes are similar. Informative priors can reduce sample sizes by incorporating prior knowledge, but only if the priors are accurate. The main advantage is flexibility: you can stop early when the posterior is precise enough, without inflating error rates.

How do I do a power analysis for a Bayesian test?

Instead of power (probability of rejecting H0), simulate experiments to find the sample size that gives you a posterior with the desired precision. For example: 'How many observations do I need so that the 95% credible interval for the difference is narrower than 2 percentage points?' Simulate data under the expected effect size and compute posterior widths.

Can I just run the experiment and stop whenever the posterior looks good?

Technically yes, Bayesian inference is valid at any stopping point. But with very small samples the posterior is dominated by the prior and may not reflect the data well. Set a minimum sample size (e.g., 100 conversions per arm) and then use posterior precision or expected loss as stopping criteria.

Key Takeaway

Bayesian sample size planning focuses on posterior precision rather than frequentist power. You ask 'How many observations until my credible interval is narrow enough to make a decision?' instead of 'How many observations to achieve 80% power?' This approach is more flexible, allows early stopping without error inflation, and naturally incorporates prior information. Use simulation to plan: generate datasets under your expected effect, compute posteriors, and find the sample size where the posterior is precise enough for your decision.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email