A/B Testing

Sequential Testing: How to Peek at P-Values Without Inflating False Positives

Learn how sequential testing methods let you monitor A/B test results as data accumulates while maintaining valid statistical guarantees. Covers group sequential designs, always-valid inference, and practical implementation.

Share

Quick Hits

  • Peeking at results without sequential methods inflates false positives from 5% to 20-30%
  • Group sequential designs pre-specify analysis times and adjust significance thresholds
  • Always-valid inference lets you peek anytime while maintaining valid confidence intervals
  • The cost of peeking correctly is wider intervals or longer expected runtime

TL;DR

Classical hypothesis tests assume you analyze data once at a pre-specified time. If you peek at results repeatedly—which every product team does—your false positive rate inflates dramatically. Sequential testing methods let you monitor experiments as data arrives while maintaining valid statistical guarantees. The cost: slightly wider confidence intervals or longer maximum runtime.


The Peeking Problem

You launch an A/B test planned for two weeks. After three days, stakeholders ask "how's it looking?" You run the analysis. Treatment is up 8% with p = 0.03. Ship it?

If this was your planned analysis time, maybe. But if you planned to run two weeks and just happened to check early, that p = 0.03 doesn't mean what you think.

Why Peeking Inflates False Positives

Classical p-values control the false positive rate at a single, pre-specified analysis time. Each additional peek is another opportunity to observe a spuriously significant result.

Think of it like coin flipping. The chance of flipping heads at least once in a single flip is 50%. But flip repeatedly? The chance of seeing heads at least once in 5 flips is 97%.

Similarly, check your experiment once at p < 0.05, and false positive rate is 5%. Check repeatedly? It compounds.

Simulation Evidence

import numpy as np
from scipy import stats

def simulate_peeking_inflation(n_simulations=10000, sample_size=5000, n_peeks=5):
    """
    Simulate A/A tests with peeking to measure false positive inflation.

    Returns:
        False positive rate across simulations
    """
    peek_points = np.linspace(sample_size // n_peeks, sample_size, n_peeks, dtype=int)
    false_positives = 0

    for _ in range(n_simulations):
        # Generate data from identical distributions (true null)
        control = np.random.normal(0, 1, sample_size)
        treatment = np.random.normal(0, 1, sample_size)  # Same distribution!

        for n in peek_points:
            _, p_value = stats.ttest_ind(control[:n], treatment[:n])
            if p_value < 0.05:
                false_positives += 1
                break  # Count as FP if ever significant

    return false_positives / n_simulations


# Run simulation
fp_rate = simulate_peeking_inflation(n_peeks=5)
print(f"False positive rate with 5 peeks: {fp_rate:.1%}")
# Typically ~20-30%, not 5%!

With 5 peeks, the false positive rate can reach 20-30%. With continuous monitoring, it approaches 100%—you'll eventually see "significance" by chance.


Solution 1: Group Sequential Designs

Group sequential designs pre-specify analysis times (called "looks") and adjust the significance threshold at each look to control overall false positive rate.

O'Brien-Fleming Boundaries

The O'Brien-Fleming approach uses very conservative thresholds early, becoming more liberal as the experiment progresses:

Look Cumulative Sample Significance Threshold
1 25% 0.00005
2 50% 0.0013
3 75% 0.0085
4 100% 0.041

Early significance requires overwhelming evidence. By the final look, the threshold is close to standard α = 0.05.

from scipy import stats
import numpy as np

def obrien_fleming_boundaries(n_looks, alpha=0.05):
    """
    Calculate O'Brien-Fleming spending function boundaries.

    Approximate boundaries using the spending function approach.
    """
    boundaries = []
    info_fractions = np.linspace(1/n_looks, 1, n_looks)

    for t in info_fractions:
        # O'Brien-Fleming spending function
        spent = 2 * (1 - stats.norm.cdf(stats.norm.ppf(1 - alpha/2) / np.sqrt(t)))
        z_boundary = stats.norm.ppf(1 - spent/2)
        boundaries.append({
            'info_fraction': t,
            'z_boundary': z_boundary,
            'p_threshold': 2 * (1 - stats.norm.cdf(z_boundary))
        })

    return boundaries


boundaries = obrien_fleming_boundaries(n_looks=4)
for b in boundaries:
    print(f"At {b['info_fraction']:.0%} info: z={b['z_boundary']:.2f}, "
          f"p-threshold={b['p_threshold']:.5f}")

Pocock Boundaries

Pocock boundaries use equal significance thresholds at each look:

Look Significance Threshold
1-4 0.0182

Simpler to explain but less efficient—you "spend" more alpha early when data is sparse.

Python Implementation with statsmodels

from statsmodels.stats.stattools import omni_normtest
import statsmodels.stats.api as sms

# Using the statsmodels GroupSequential class (if available)
# Otherwise, implement boundaries manually:

def sequential_test(control, treatment, n_looks=4, spending='obrien_fleming'):
    """
    Run a group sequential test with pre-specified looks.
    """
    n = len(control)
    look_sizes = [n * (i+1) // n_looks for i in range(n_looks)]

    boundaries = obrien_fleming_boundaries(n_looks)

    for i, look_n in enumerate(look_sizes):
        c_sample = control[:look_n]
        t_sample = treatment[:look_n]

        stat, p_value = stats.ttest_ind(c_sample, t_sample)
        z = stats.norm.ppf(1 - p_value/2) if p_value < 1 else 0

        threshold = boundaries[i]['z_boundary']

        print(f"Look {i+1} (n={look_n}): z={abs(z):.3f}, threshold={threshold:.3f}")

        if abs(z) > threshold:
            return {
                'stopped_early': True,
                'look': i + 1,
                'p_value': p_value,
                'effect': np.mean(t_sample) - np.mean(c_sample)
            }

    return {
        'stopped_early': False,
        'look': n_looks,
        'p_value': p_value,
        'effect': np.mean(treatment) - np.mean(control)
    }

R Implementation

library(gsDesign)

# Create a group sequential design
design <- gsDesign(
  k = 4,              # 4 looks
  test.type = 2,      # Two-sided
  alpha = 0.05,
  beta = 0.2,         # 80% power
  sfu = "OF"          # O'Brien-Fleming spending
)

# View boundaries
print(design)

# Check at each look
# If z-statistic exceeds boundary, stop and reject

Solution 2: Always-Valid Inference

Group sequential designs require pre-specifying look times. What if you want to check anytime without a schedule?

Always-valid inference provides confidence sequences that remain valid no matter when or how often you look.

Confidence Sequences

A confidence sequence is a sequence of confidence intervals that, with probability 1-α, simultaneously contain the true parameter at all times.

import numpy as np

def confidence_sequence_mean(data, alpha=0.05):
    """
    Compute always-valid confidence sequence for the mean.

    Uses the mixture method with stitching.
    """
    n = len(data)
    if n < 2:
        return (-np.inf, np.inf)

    mean = np.mean(data)
    var = np.var(data, ddof=1)

    # Asymptotic confidence sequence width
    # Based on the law of iterated logarithm
    rho = 1  # mixture parameter
    width = np.sqrt(2 * var * (1 + 1/n) * (np.log(np.log(max(n, np.e))) + 0.72 + np.log(2/alpha)) / n)

    return (mean - width, mean + width)


def sequential_comparison(control, treatment, alpha=0.05):
    """
    Compare two groups using confidence sequences.

    Returns confidence sequence for the difference.
    """
    # Compute difference for each paired observation
    # Or use separate sequences for each group

    # For unpaired, use separate sequences
    cs_control = confidence_sequence_mean(control, alpha/2)
    cs_treatment = confidence_sequence_mean(treatment, alpha/2)

    # Difference bounds (conservative combination)
    diff_lower = (np.mean(treatment) - np.mean(control) -
                  (cs_treatment[1] - np.mean(treatment)) -
                  (np.mean(control) - cs_control[0]))
    diff_upper = (np.mean(treatment) - np.mean(control) +
                  (cs_treatment[1] - np.mean(treatment)) +
                  (np.mean(control) - cs_control[0]))

    return diff_lower, diff_upper

Key Property

At any time t, the confidence sequence either:

  • Excludes zero → stop and declare significance
  • Includes zero → continue (or stop for futility)

You never "spend" significance level by peeking—the intervals are calibrated to be valid at all times simultaneously.

The Cost

Always-valid intervals are wider than fixed-time intervals at any given sample size. You pay for the flexibility to stop anytime with less precision at each moment.


Solution 3: Bayesian Monitoring

Bayesian approaches sidestep the frequentist peeking problem because they make direct statements about parameter probabilities, not long-run error rates.

Bayesian A/B Testing

import numpy as np
from scipy import stats

def bayesian_ab_test(control_conversions, control_total,
                     treatment_conversions, treatment_total,
                     prior_alpha=1, prior_beta=1):
    """
    Bayesian comparison of two conversion rates.

    Uses Beta-Binomial conjugate model.
    Returns probability that treatment > control.
    """
    # Posterior distributions (Beta)
    control_alpha = prior_alpha + control_conversions
    control_beta = prior_beta + control_total - control_conversions

    treatment_alpha = prior_alpha + treatment_conversions
    treatment_beta = prior_beta + treatment_total - treatment_conversions

    # Monte Carlo estimate of P(treatment > control)
    n_samples = 100000
    control_samples = np.random.beta(control_alpha, control_beta, n_samples)
    treatment_samples = np.random.beta(treatment_alpha, treatment_beta, n_samples)

    prob_treatment_better = np.mean(treatment_samples > control_samples)

    return {
        'prob_treatment_better': prob_treatment_better,
        'prob_control_better': 1 - prob_treatment_better,
        'control_mean': control_alpha / (control_alpha + control_beta),
        'treatment_mean': treatment_alpha / (treatment_alpha + treatment_beta)
    }


# Example: Check at any time
result = bayesian_ab_test(
    control_conversions=500, control_total=10000,
    treatment_conversions=550, treatment_total=10000
)
print(f"P(treatment > control): {result['prob_treatment_better']:.1%}")

Stopping Rules

You can stop when:

  • P(treatment better) > 0.95 → ship treatment
  • P(control better) > 0.95 → keep control
  • Neither after maximum sample → declare inconclusive

The Catch

While Bayesian credible intervals don't suffer from the peeking problem in the same way, decision rules based on posterior probabilities can still have frequentist operating characteristics you care about (false positive rate, power). Design your stopping rules with these in mind.


Choosing a Method

Method Pros Cons
Group Sequential Well-established theory, easy to explain Must pre-specify look times
Always-Valid Peek anytime, no schedule needed Wider intervals, newer theory
Bayesian Intuitive interpretation, natural stopping Requires prior specification, frequentist properties need verification

Practical Recommendation

For most product teams:

  1. If you'll definitely peek on a schedule: Use group sequential design with O'Brien-Fleming boundaries
  2. If you need flexibility: Use always-valid confidence sequences
  3. If your org is Bayesian-friendly: Use Bayesian monitoring with calibrated stopping rules

Any of these is dramatically better than pretending you won't peek.


Implementation Checklist

Before the Test

  • Decide whether you'll use sequential testing
  • Choose your method (group sequential, always-valid, Bayesian)
  • Pre-specify analysis times (if group sequential)
  • Set stopping rules for both efficacy and futility
  • Document everything

During the Test

  • Only analyze at pre-specified times (if group sequential)
  • Use proper adjusted thresholds for significance
  • Record all looks and decisions

After the Test

  • Report using the sequential framework
  • Don't back-calculate "what the p-value would have been" with fixed-horizon methods

Common Mistakes

Using Fixed-Horizon P-Values After Peeking

If you peeked, your p-value from a standard test is not valid. Report results using the sequential method you actually used.

Informal "Just Checking"

There's no such thing as just checking without consequences. Either commit to not looking, or use sequential methods.

Stopping for Futility Without Planning

Stopping because results "look flat" is itself a form of peeking. Pre-specify futility stopping rules if you want that option.

Ignoring the Confidence Interval

Even with sequential testing, focus on the confidence interval, not just significance. A significant result with wide intervals may not be actionable.



Frequently Asked Questions

Q: How much extra sample does sequential testing require? A: Maximum sample is typically 20-30% larger than fixed-horizon. But average sample is often lower because you stop early for large effects.

Q: Can I convert an existing test to sequential testing mid-experiment? A: No. The validity guarantees require planning before data collection. You can't retroactively apply sequential methods.

Q: What if I need to peek but my org doesn't support sequential testing? A: Document every look you take. Report your results with appropriate caveats. Advocate for proper sequential methods in future experiments.

Q: Is Bayesian testing really immune to peeking? A: Bayesian posterior probabilities don't have the same peeking problem. However, if you care about frequentist error rates (and you should for decision-making), your stopping rules still need calibration.


Key Takeaway

If you're going to peek at your A/B test results (and you probably are), use sequential testing methods that are designed for it. The alternative—pretending you didn't peek—leads to false positives that erode trust in your experimentation program. The cost of doing it right (slightly wider intervals or longer runtime) is far less than the cost of systematic false discoveries.


References

  1. https://arxiv.org/abs/1512.04922
  2. https://www.tandfonline.com/doi/abs/10.1080/01621459.1977.10479947
  3. Johari, R., Koomen, P., Pekelis, L., & Walsh, D. (2017). Peeking at A/B Tests: Why It Matters, and What to Do About It. *KDD '17*.
  4. Jennison, C., & Turnbull, B. W. (2000). *Group Sequential Methods with Applications to Clinical Trials*. Chapman & Hall/CRC.
  5. Howard, S. R., Ramdas, A., McAuliffe, J., & Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences. *The Annals of Statistics*, 49(2), 1055-1080.

Frequently Asked Questions

How much does sequential testing cost in sample size?
Typically 20-30% more maximum sample size compared to fixed-horizon tests. However, you often stop early when effects are large, so average sample size may be lower.
Can I use sequential testing for any metric?
Yes, but it's most straightforward for means and proportions. Complex metrics may require bootstrap-based sequential methods.
Should I always use sequential testing?
If you'll ever look at results before the planned end date—and most teams do—then yes. The overhead is worth the validity guarantee.

Key Takeaway

If you're going to peek at your A/B test results (and you probably are), use sequential testing methods that are designed for it. The alternative—pretending you didn't peek—leads to false positives that erode trust in your experimentation program.

Send to a friend

Share this with someone who loves clean statistical work.