A/B Testing

Sample Ratio Mismatch: Detection, Root Causes, and Solutions

How to detect sample ratio mismatch (SRM) in A/B tests, understand its common causes, and what to do when your experiment groups have unexpected sizes.

Jan 269 min readstatstest_flow A/B Testing Supporting

Sample Ratio Mismatch: Detection, Root Causes, and Solutions

Quick Hits

•SRM is the canary in the coal mine—it tells you something is wrong before you misinterpret results
•Even a 1-2% ratio mismatch can invalidate your experiment
•The most common causes are bugs in randomization, differential logging, and bot traffic
•Never try to statistically 'correct' for SRM—find and fix the root cause instead

TL;DR

Sample Ratio Mismatch (SRM) occurs when the observed split between experiment groups differs significantly from the expected split. A 50/50 experiment showing 52/48 is a problem. SRM means something is systematically biased in your randomization, logging, or user experience—and any "results" from that experiment are unreliable. Always check for SRM before interpreting results.

What Is Sample Ratio Mismatch?

In a properly randomized experiment, if you assign 50% of users to control and 50% to treatment, you should observe approximately 50% in each group. Random variation means you won't hit exactly 50/50, but the deviation should be explainable by chance.

SRM occurs when the observed ratio is statistically incompatible with the expected ratio. If you expected 50/50 but observed 52/48 with 100,000 users, that deviation is far too large to be random chance.

Why It Matters

SRM indicates a systematic problem. Users in control and treatment are no longer comparable—they differ in some way beyond just the treatment. Any observed effect might be entirely due to this selection bias, not your change.

Examples:

Treatment has more bots → inflates treatment conversions (bots don't convert)
Control has slower page loads → users drop off before being logged → control appears smaller but higher converting
Randomization bug assigns power users to treatment → treatment looks better regardless of change

Detecting SRM

The Chi-Square Test

Use a chi-square goodness-of-fit test to compare observed counts to expected counts:

from scipy.stats import chisquare
import numpy as np

def check_srm(observed_control, observed_treatment, expected_ratio=0.5):
    """
    Test for Sample Ratio Mismatch.

    Args:
        observed_control: Number of users in control
        observed_treatment: Number of users in treatment
        expected_ratio: Expected proportion in control (default 0.5)

    Returns:
        dict with SRM status, p-value, and observed ratio
    """
    total = observed_control + observed_treatment
    expected_control = total * expected_ratio
    expected_treatment = total * (1 - expected_ratio)

    observed = [observed_control, observed_treatment]
    expected = [expected_control, expected_treatment]

    stat, p_value = chisquare(observed, expected)

    observed_ratio = observed_control / total

    return {
        'has_srm': p_value < 0.001,  # Use stricter threshold
        'p_value': p_value,
        'observed_ratio': observed_ratio,
        'expected_ratio': expected_ratio,
        'chi_square': stat
    }


# Example: Expected 50/50, observed 51,500 control / 48,500 treatment
result = check_srm(51500, 48500, expected_ratio=0.5)
print(f"SRM detected: {result['has_srm']}")
print(f"P-value: {result['p_value']:.2e}")
print(f"Observed ratio: {result['observed_ratio']:.2%}")

R Implementation

check_srm <- function(obs_control, obs_treatment, expected_ratio = 0.5) {
  total <- obs_control + obs_treatment
  expected <- c(total * expected_ratio, total * (1 - expected_ratio))
  observed <- c(obs_control, obs_treatment)

  test <- chisq.test(observed, p = c(expected_ratio, 1 - expected_ratio))

  list(
    has_srm = test$p.value < 0.001,
    p_value = test$p.value,
    observed_ratio = obs_control / total
  )
}

# Example
result <- check_srm(51500, 48500)
print(result)

Threshold for Concern

Use $p < 0.001$ rather than $p < 0.05$ . Because you're checking SRM routinely (every experiment), a 5% threshold would flag too many false alarms. The stricter threshold reduces false positives while still catching meaningful problems.

Common Causes of SRM

1. Randomization Bugs

Cause: The randomization code itself has a bug—perhaps a hash function that doesn't distribute evenly, or conditional logic that biases assignment.

Symptoms: Consistent SRM across all experiments using the same randomization system.

Fix: Audit randomization code. Run A/A tests to verify distribution.

# Example A/A test to verify randomization
def verify_randomization(user_ids, hash_function, n_simulations=100):
    """Run A/A simulation to check for randomization bias."""
    control_counts = []

    for seed in range(n_simulations):
        control = sum(1 for uid in user_ids
                     if hash_function(uid, seed) % 2 == 0)
        control_counts.append(control / len(user_ids))

    mean_ratio = np.mean(control_counts)
    std_ratio = np.std(control_counts)

    print(f"Mean control ratio: {mean_ratio:.4f} (should be ~0.5)")
    print(f"Std of ratio: {std_ratio:.4f}")

2. Redirect Latency / Differential Loading

Cause: Treatment requires a redirect or additional loading time. Some users navigate away before the experience loads.

Symptoms: Treatment group is smaller. Higher-intent or faster-connection users are overrepresented in treatment.

Fix: Ensure assignment happens as early as possible, ideally server-side before page render. Measure time-to-assignment.

3. Bot Traffic

Cause: Bots trigger randomization differently across variants—perhaps treatment has a new element that bots don't interact with, or bot detection fires at different rates.

Symptoms: SRM appears when including all traffic, resolves when excluding bots.

Fix: Improve bot detection. Run experiments on bot-filtered traffic. Check if bot rates differ by variant.

4. Browser Caching

Cause: Cached content serves stale experiences. Users in one variant might get cached pages from another variant's assignment.

Symptoms: SRM that varies by browser type or returning vs. new users.

Fix: Ensure experiment assignment is not cached, or cache is properly invalidated on assignment change.

5. Instrumentation / Logging Bugs

Cause: The assignment happens correctly, but logging fails differentially. Treatment might have new code that delays or blocks event logging.

Symptoms: SRM in logged data but not in actual assignment counts (if you can access those separately).

Fix: Audit logging pipelines. Ensure logging happens at assignment time, not later in the experience.

6. Trigger Timing Issues

Cause: Experiment triggers on some user action, but treatment affects whether/when that action occurs.

Symptoms: SRM with specific trigger conditions.

Fix: Use intent-to-treat assignment. Randomize all potentially eligible users, not just those who complete the triggering action.

7. Partial Exposure / Gradual Rollout

Cause: Treatment was ramped from 0% to 50% over time, but analysis treats all time periods equally.

Symptoms: SRM that reflects the weighted average of different allocation percentages.

Fix: Either analyze only the stable period, or weight observations by their allocation period.

Diagnosing SRM

When you detect SRM, systematic diagnosis is essential:

Step 1: Confirm the Expected Ratio

Are you sure the expected split is what you think? Check:

Experiment configuration
Ramp-up history
Eligibility criteria that might differ

Step 2: Segment Analysis

Break down the ratio by dimensions to find where SRM concentrates:

def diagnose_srm_by_segment(df, segments=['browser', 'device', 'country']):
    """Check SRM within segments to diagnose root cause."""
    for segment in segments:
        print(f"\n--- SRM by {segment} ---")
        for value, group in df.groupby(segment):
            control = (group['variant'] == 'control').sum()
            treatment = (group['variant'] == 'treatment').sum()
            result = check_srm(control, treatment)

            if result['has_srm']:
                print(f"{value}: PROBLEM - ratio={result['observed_ratio']:.2%}, "
                      f"p={result['p_value']:.2e}")
            else:
                print(f"{value}: OK - ratio={result['observed_ratio']:.2%}")

Common segments to check:

Browser type
Device type (mobile/desktop)
New vs. returning users
Geographic region
Date of assignment
Bot vs. human

Step 3: Time Series Analysis

Plot ratio over time to identify when SRM started:

import pandas as pd
import matplotlib.pyplot as plt

def plot_srm_over_time(df, date_col='assignment_date'):
    """Plot control ratio over time to identify when SRM began."""
    daily = df.groupby(date_col).apply(
        lambda x: (x['variant'] == 'control').mean()
    )

    plt.figure(figsize=(12, 4))
    plt.plot(daily.index, daily.values)
    plt.axhline(y=0.5, color='r', linestyle='--', label='Expected')
    plt.xlabel('Date')
    plt.ylabel('Control Ratio')
    plt.title('Sample Ratio Over Time')
    plt.legend()
    plt.show()

Step 4: Check Logging vs. Assignment

If possible, compare counts at different pipeline stages:

Assignment system logs
Client-side event logs
Analytics system counts

Discrepancies reveal where data is being lost.

What NOT to Do

Don't Ignore It

"It's only 2% off" is not acceptable. Even small SRM indicates a systematic problem that might be larger in specific segments or future experiments.

Don't Statistically Correct

You cannot re-weight or adjust for SRM. The bias is unknown and potentially correlated with your outcome. Any "correction" is speculation.

Don't Cherry-Pick Segments

"SRM goes away if we exclude mobile Safari users" might be diagnosing the cause, or it might be p-hacking. Only exclude segments if you have a clear causal explanation for why that segment's data is corrupted.

Don't Interpret Results

An experiment with SRM has no valid results. The treatment effect estimate is biased by an unknown amount in an unknown direction.

What TO Do

1. Pause Interpretation

Don't ship based on an experiment with SRM. You don't know what the true effect is.

2. Investigate Root Cause

Use the diagnostic steps above. Find the actual reason for the mismatch.

3. Fix the Problem

Address the root cause in your experiment infrastructure.

4. Run a Clean Experiment

Start fresh after fixing. Don't try to salvage data from the compromised experiment.

5. Prevent Future SRM

Add automated SRM checks to your experiment analysis pipeline
Run A/A tests regularly to validate randomization
Monitor SRM in real-time during experiment runtime

Automating SRM Detection

Build SRM checking into your experiment workflow:

class ExperimentAnalysis:
    def __init__(self, control_data, treatment_data, expected_ratio=0.5):
        self.control = control_data
        self.treatment = treatment_data
        self.expected_ratio = expected_ratio

    def check_srm(self):
        """Check for SRM before any other analysis."""
        result = check_srm(
            len(self.control),
            len(self.treatment),
            self.expected_ratio
        )

        if result['has_srm']:
            raise ValueError(
                f"SRM detected! Observed ratio {result['observed_ratio']:.2%} "
                f"vs expected {result['expected_ratio']:.2%}. "
                f"p-value: {result['p_value']:.2e}. "
                f"Analysis cannot proceed until SRM is resolved."
            )

        return True

    def analyze(self):
        """Run full analysis only if SRM check passes."""
        self.check_srm()
        # ... proceed with analysis ...

A/B Testing Statistical Methods for Product Teams — Complete guide to A/B testing
Sequential Testing — Monitoring experiments without inflating false positives
Pre-Analysis Checklist — Complete checklist before analyzing any experiment

Frequently Asked Questions

Q: My SRM is significant but only 0.5% off from expected. Is that really a problem? A: With large samples, even tiny deviations can be significant. Focus on practical significance: is the deviation large enough to materially bias your results? For mission-critical decisions, investigate any significant SRM. For lower-stakes tests, document it and proceed with caution.

Q: Can SRM happen by chance in a properly run experiment? A: At $p < 0.001$ , SRM happens by chance in 0.1% of experiments. If you run 1,000 experiments per year, you'll see ~1 spurious SRM. True SRM (indicating real problems) is far more common.

Q: We found the bug and fixed it. Can we use data from after the fix? A: Possibly, if the fix was clean and you have enough data from the post-fix period. But be careful: users who entered during the buggy period might still be in the post-fix data if they returned. Starting a fresh experiment is safer.

Q: Does SRM affect all metrics equally? A: No. If SRM causes treatment to have more high-intent users, conversion rates will be biased more than metrics that don't correlate with intent. But you can't know which metrics are more or less biased without knowing the exact cause.

Key Takeaway

Sample ratio mismatch is your experiment's check engine light. It tells you something is wrong with the randomization or measurement process. Never ignore it, never try to correct for it statistically—find and fix the root cause, then run a clean experiment. Building automated SRM checks into your analysis pipeline is one of the highest-value investments an experimentation program can make.

References

https://www.exp-platform.com/Documents/2019-FirstPracticalOnlineControlledExperimentswithSwitchbackDesignsSRM.pdf
https://arxiv.org/abs/1910.02474
Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., & Pohlmann, N. (2013). Online Controlled Experiments at Large Scale. *KDD '13*.
Fabijan, A., Gupchup, J., Gupta, S., Omhover, J., Qin, W., Vermeer, L., & Dmitriev, P. (2019). Diagnosing Sample Ratio Mismatch in Online Controlled Experiments. *KDD '19*.
Chen, N., Liu, M., & Xu, Y. (2019). How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments. *WSDM '19*.

Frequently Asked Questions

How much mismatch is acceptable?

Technically, any statistically significant mismatch indicates a problem. In practice, focus on mismatches greater than 1% that are statistically significant (p < 0.001). Smaller mismatches may be noise, but significant ones need investigation.

Can I still trust my results if SRM is small?

Small SRM might not materially affect results, but it indicates a systematic issue. Investigate the cause—the same bug causing minor SRM might cause larger bias in specific segments.

My SRM disappeared after excluding bots. Is the experiment valid now?

Maybe. If bots were the only cause and your bot detection is sound, the experiment may be fine. But document this carefully and verify bot detection isn't correlated with treatment.

Key Takeaway

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email