Reporting

Experiment Guardrails: Stopping Rules, Ramp Criteria, and Managing Risk

Protect your experiments and users with proper guardrails. Learn when to stop an experiment, how to safely ramp exposure, and what metrics should trigger automatic rollback.

Share

Quick Hits

  • Guardrails = metrics that trigger alerts or automatic rollback when violated
  • Define stopping rules before launch: 'Stop if error rate increases >20%'
  • Ramp gradually: 1% → 5% → 25% → 50% → 100%, with checks at each stage
  • Distinguish between 'ship' metrics (must improve) and 'guardrail' metrics (must not regress)
  • Automated monitoring catches problems faster than manual review

TL;DR

Guardrails protect experiments from causing harm: metrics that must not regress (error rates, latency), thresholds that trigger alerts (>10% increase), and automatic rollback rules. Define them before launch. Ramp exposure gradually (1% → 5% → 25% → 50% → 100%) with checks at each stage. When a guardrail fires, investigate immediately—the cost of a false alarm is low; the cost of shipping a broken feature is high.


What Are Guardrails?

Definition

Guardrails = Metrics that must not significantly regress, regardless of primary metric performance.

Primary Metric: "What we want to improve"
  Example: Conversion rate

Guardrail Metrics: "What we must not break"
  Examples: Error rate, latency, crashes, revenue

The Guardrail Mentality

Without Guardrails With Guardrails
"Conversion up 5%! Ship it!" "Conversion up 5%, but error rate up 30%. Investigate."
Problems discovered post-launch Problems caught during experiment
Reactive incident response Proactive risk management
"Why didn't we catch this?" "Guardrail triggered—good thing we checked"

Types of Guardrails

1. Primary Guardrails (Must Monitor)

These should be in every experiment:

def define_primary_guardrails():
    """
    Core guardrails that apply to almost every experiment.
    """
    guardrails = {
        'error_rate': {
            'description': 'Server errors (5xx), client errors',
            'threshold': 'No increase > 10%',
            'rationale': 'User experience, data integrity'
        },
        'latency_p95': {
            'description': '95th percentile page load time',
            'threshold': 'No increase > 15%',
            'rationale': 'User experience, SEO impact'
        },
        'crash_rate': {
            'description': 'App crashes per session',
            'threshold': 'No increase > 5%',
            'rationale': 'Severe user impact'
        },
        'sample_ratio': {
            'description': 'Ratio of users in each variant',
            'threshold': 'SRM test p > 0.001',
            'rationale': 'Data quality, randomization bugs'
        }
    }

    print("PRIMARY GUARDRAILS")
    print("=" * 60)
    for name, config in guardrails.items():
        print(f"\n{name.upper()}")
        print(f"  Description: {config['description']}")
        print(f"  Threshold: {config['threshold']}")
        print(f"  Why: {config['rationale']}")

    return guardrails


define_primary_guardrails()

2. Secondary Guardrails (Context-Dependent)

Add based on your feature:

def define_secondary_guardrails(feature_type):
    """
    Context-dependent guardrails based on feature type.
    """
    guardrail_sets = {
        'checkout_flow': [
            {'metric': 'payment_errors', 'threshold': 'No increase > 5%'},
            {'metric': 'abandoned_carts', 'threshold': 'No increase > 10%'},
            {'metric': 'support_tickets', 'threshold': 'No increase > 20%'},
            {'metric': 'refund_rate', 'threshold': 'No increase > 5%'},
        ],
        'search_ranking': [
            {'metric': 'zero_result_rate', 'threshold': 'No increase > 5%'},
            {'metric': 'time_to_first_click', 'threshold': 'No increase > 15%'},
            {'metric': 'query_abandonment', 'threshold': 'No increase > 10%'},
        ],
        'notification': [
            {'metric': 'unsubscribe_rate', 'threshold': 'No increase > 10%'},
            {'metric': 'spam_reports', 'threshold': 'No increase > 5%'},
            {'metric': 'app_uninstalls', 'threshold': 'No increase > 5%'},
        ],
        'performance': [
            {'metric': 'memory_usage', 'threshold': 'No increase > 10%'},
            {'metric': 'battery_drain', 'threshold': 'No increase > 10%'},
            {'metric': 'data_usage', 'threshold': 'No increase > 20%'},
        ]
    }

    guardrails = guardrail_sets.get(feature_type, [])

    print(f"GUARDRAILS FOR: {feature_type.upper()}")
    print("=" * 50)
    for g in guardrails:
        print(f"  • {g['metric']}: {g['threshold']}")

    return guardrails


define_secondary_guardrails('checkout_flow')

3. Business Guardrails

Metrics where regression has business consequences:

## Business Guardrails

### Revenue
- **Threshold**: No decrease > 2%
- **When to use**: Features not intended to affect revenue
- **Why**: Even unrelated features can accidentally hurt revenue

### Engagement (DAU/MAU)
- **Threshold**: No decrease > 1%
- **When to use**: Any major user-facing change
- **Why**: Proxy for long-term health

### Retention (Day 7, Day 30)
- **Threshold**: No decrease (within measurement precision)
- **When to use**: Onboarding, core experience changes
- **Why**: Leading indicator of churn

Setting Thresholds

Framework for Threshold Selection

import numpy as np
from scipy import stats


def set_guardrail_threshold(historical_data, sensitivity='medium'):
    """
    Set guardrail threshold based on historical variability.

    Parameters:
    -----------
    historical_data : array
        Historical values of the metric (e.g., daily error rates)
    sensitivity : str
        'high' (tight), 'medium', or 'low' (loose)
    """
    mean = np.mean(historical_data)
    std = np.std(historical_data)

    # Sensitivity multipliers
    multipliers = {
        'high': 1.5,    # Tight: catch small changes, more false alarms
        'medium': 2.5,  # Balanced
        'low': 4.0      # Loose: only catch big changes, fewer false alarms
    }

    multiplier = multipliers.get(sensitivity, 2.5)

    # Threshold as relative increase from mean
    threshold_absolute = mean + multiplier * std
    threshold_relative = (threshold_absolute - mean) / mean * 100

    return {
        'baseline_mean': mean,
        'baseline_std': std,
        'threshold_absolute': threshold_absolute,
        'threshold_relative_pct': threshold_relative,
        'sensitivity': sensitivity
    }


# Example: Error rate over 30 days
historical_error_rate = np.array([
    0.012, 0.011, 0.013, 0.012, 0.014, 0.011, 0.010, 0.013,
    0.012, 0.015, 0.011, 0.012, 0.013, 0.011, 0.012, 0.014,
    0.010, 0.011, 0.012, 0.013, 0.012, 0.011, 0.012, 0.013,
    0.012, 0.011, 0.013, 0.012, 0.011, 0.012
])

for sens in ['high', 'medium', 'low']:
    result = set_guardrail_threshold(historical_error_rate, sens)
    print(f"\n{sens.upper()} Sensitivity:")
    print(f"  Baseline: {result['baseline_mean']:.2%} ± {result['baseline_std']:.2%}")
    print(f"  Threshold: {result['threshold_absolute']:.2%}")
    print(f"  Alert if: >{result['threshold_relative_pct']:.0f}% increase")

Threshold by Metric Type

Metric Type Typical Threshold Rationale
Error rate 10-20% increase Balance sensitivity/noise
Latency (P95) 15-25% increase Noticeable to users above this
Crash rate 5-10% increase High impact, tight threshold
Revenue 2-5% decrease Business critical
Conversion Context-dependent May be primary metric

Stopping Rules

When to Stop Immediately

def should_stop_experiment(guardrail_results):
    """
    Determine if experiment should be stopped based on guardrail status.
    """
    stop_reasons = []

    for metric, result in guardrail_results.items():
        # Immediate stop conditions
        if result['status'] == 'violated':
            if result['severity'] == 'critical':
                stop_reasons.append({
                    'action': 'STOP IMMEDIATELY',
                    'metric': metric,
                    'reason': result['reason'],
                    'urgency': 'NOW'
                })
            elif result['severity'] == 'warning':
                stop_reasons.append({
                    'action': 'PAUSE AND INVESTIGATE',
                    'metric': metric,
                    'reason': result['reason'],
                    'urgency': '24 hours'
                })

    if stop_reasons:
        print("⚠️  STOPPING RULES TRIGGERED")
        print("=" * 50)
        for reason in stop_reasons:
            print(f"\n{reason['action']} ({reason['urgency']})")
            print(f"  Metric: {reason['metric']}")
            print(f"  Reason: {reason['reason']}")
    else:
        print("✓ All guardrails passing")

    return stop_reasons


# Example
guardrail_results = {
    'error_rate': {
        'status': 'violated',
        'severity': 'critical',
        'reason': 'Error rate increased 45% (threshold: 20%)'
    },
    'latency_p95': {
        'status': 'ok',
        'severity': None,
        'reason': 'Within threshold'
    },
    'srm': {
        'status': 'violated',
        'severity': 'critical',
        'reason': 'Sample ratio mismatch detected (p < 0.001)'
    }
}

should_stop_experiment(guardrail_results)

Stopping Rule Framework

## Pre-Defined Stopping Rules

### STOP IMMEDIATELY
- SRM detected (p < 0.001)
- Error rate increase > 50%
- Crash rate increase > 25%
- Revenue decrease > 10%
- Any safety-related regression

### PAUSE AND INVESTIGATE (24 hours)
- Error rate increase > 20%
- Latency increase > 30%
- Significant support ticket increase
- Unexplained metric anomalies

### CONTINUE WITH MONITORING
- Error rate increase 10-20% (close to threshold)
- Minor latency increase
- Primary metric not yet significant (expected)

### DO NOT STOP FOR
- Primary metric not significant yet (patience)
- Stakeholder wants early results
- Competitor pressure
- Point estimate is negative but CI includes positive

Ramp Criteria

The Ramp Schedule

def create_ramp_schedule(feature_risk='medium'):
    """
    Create exposure ramp schedule based on feature risk.
    """
    schedules = {
        'low': [
            {'pct': 5, 'duration': '1 day', 'checks': ['basic']},
            {'pct': 25, 'duration': '2 days', 'checks': ['basic', 'metrics']},
            {'pct': 100, 'duration': 'ongoing', 'checks': ['full']},
        ],
        'medium': [
            {'pct': 1, 'duration': '1 day', 'checks': ['basic']},
            {'pct': 5, 'duration': '2 days', 'checks': ['basic', 'metrics']},
            {'pct': 25, 'duration': '3 days', 'checks': ['full']},
            {'pct': 50, 'duration': '3 days', 'checks': ['full']},
            {'pct': 100, 'duration': 'ongoing', 'checks': ['full']},
        ],
        'high': [
            {'pct': 0.1, 'duration': '1 day', 'checks': ['manual']},
            {'pct': 1, 'duration': '2 days', 'checks': ['basic', 'manual']},
            {'pct': 5, 'duration': '3 days', 'checks': ['full']},
            {'pct': 10, 'duration': '3 days', 'checks': ['full']},
            {'pct': 25, 'duration': '5 days', 'checks': ['full']},
            {'pct': 50, 'duration': '5 days', 'checks': ['full']},
            {'pct': 100, 'duration': 'ongoing', 'checks': ['full']},
        ]
    }

    schedule = schedules.get(feature_risk, schedules['medium'])

    print(f"RAMP SCHEDULE ({feature_risk.upper()} risk)")
    print("=" * 60)
    for stage in schedule:
        print(f"\n{stage['pct']:>5}% exposure for {stage['duration']}")
        print(f"       Checks: {', '.join(stage['checks'])}")

    return schedule


create_ramp_schedule('high')

What to Check at Each Stage

## Ramp Check Types

### 'basic' (1% exposure)
- Feature loads without errors
- No obvious bugs in logs
- SRM check passing

### 'metrics' (5-10% exposure)
- Guardrail metrics within bounds
- No anomalous patterns
- SRM still passing

### 'full' (25%+ exposure)
- All guardrail metrics checked with statistical tests
- Primary metric trending as expected (or at least not negative)
- User feedback (if available)
- Support ticket volume

### 'manual' (high-risk features)
- Manual QA of user sessions
- Review of edge cases
- Stakeholder sign-off before next stage

Automated Ramp Decision

def ramp_decision(current_pct, guardrail_results, days_at_current):
    """
    Decide whether to ramp up, hold, or roll back.
    """
    # Count guardrail statuses
    violations = sum(1 for g in guardrail_results.values()
                    if g['status'] == 'violated')
    warnings = sum(1 for g in guardrail_results.values()
                  if g['status'] == 'warning')

    # Decision logic
    if violations > 0:
        return {
            'decision': 'ROLL BACK',
            'reason': f'{violations} guardrail violation(s)',
            'action': 'Reduce to previous stage or 0%'
        }
    elif warnings >= 2:
        return {
            'decision': 'HOLD',
            'reason': f'{warnings} guardrail warnings',
            'action': 'Investigate before proceeding'
        }
    elif days_at_current < 2 and current_pct < 100:
        return {
            'decision': 'HOLD',
            'reason': 'Minimum duration not met',
            'action': f'Wait {2 - days_at_current} more day(s)'
        }
    else:
        return {
            'decision': 'RAMP UP',
            'reason': 'All checks passing',
            'action': 'Proceed to next exposure level'
        }


# Example
decision = ramp_decision(
    current_pct=5,
    guardrail_results={
        'error_rate': {'status': 'ok'},
        'latency': {'status': 'warning'},
        'srm': {'status': 'ok'}
    },
    days_at_current=3
)

print(f"Decision: {decision['decision']}")
print(f"Reason: {decision['reason']}")
print(f"Action: {decision['action']}")

Automated Monitoring

Monitoring System Requirements

def design_monitoring_system():
    """
    Key components of an experiment monitoring system.
    """
    components = {
        'real_time_metrics': {
            'latency': '< 5 min delay',
            'errors': '< 5 min delay',
            'crashes': '< 15 min delay',
            'update_frequency': 'Every 5 minutes'
        },
        'batch_metrics': {
            'conversion': 'Daily',
            'revenue': 'Daily',
            'retention': 'Weekly',
            'update_frequency': 'Once per day'
        },
        'alerts': {
            'channels': ['Slack', 'PagerDuty', 'Email'],
            'escalation': 'Auto-page on-call for critical',
            'deduplication': 'Alert once per threshold cross'
        },
        'dashboards': {
            'real_time': 'Current guardrail status',
            'trends': 'Metric trends over experiment',
            'comparison': 'Control vs Treatment plots'
        }
    }

    print("MONITORING SYSTEM DESIGN")
    print("=" * 60)
    for component, config in components.items():
        print(f"\n{component.upper()}")
        for key, value in config.items():
            print(f"  {key}: {value}")


design_monitoring_system()

Alert Configuration

# Example alert configuration

alerts:
  - name: "Error Rate Critical"
    metric: error_rate
    condition: "> baseline * 1.5"  # 50% increase
    severity: critical
    action: "page_oncall"

  - name: "Error Rate Warning"
    metric: error_rate
    condition: "> baseline * 1.2"  # 20% increase
    severity: warning
    action: "slack_channel"

  - name: "SRM Detected"
    metric: sample_ratio_mismatch
    condition: "p_value < 0.001"
    severity: critical
    action: "pause_experiment"

  - name: "Latency Degradation"
    metric: latency_p95
    condition: "> baseline * 1.25"
    severity: warning
    action: "slack_channel"

  - name: "Revenue Drop"
    metric: revenue_per_user
    condition: "< baseline * 0.95"  # 5% decrease
    severity: critical
    action: "page_oncall"

Responding to Guardrail Violations

The Response Playbook

## Guardrail Violation Response Playbook

### Step 1: Verify (5 minutes)
- [ ] Is the alert real or a data glitch?
- [ ] Check metric source for data quality
- [ ] Verify experiment is still running correctly

### Step 2: Assess Severity (5 minutes)
- [ ] How far above threshold?
- [ ] How many users affected?
- [ ] Is it getting worse?

### Step 3: Decide (5 minutes)
- [ ] CRITICAL: Pause experiment immediately
- [ ] WARNING: Continue monitoring, prepare to pause
- [ ] FALSE ALARM: Document and close

### Step 4: Investigate (if paused)
- [ ] Check logs for errors
- [ ] Compare control vs treatment behavior
- [ ] Identify root cause
- [ ] Determine if fixable

### Step 5: Resolve
- [ ] Fix issue and resume, OR
- [ ] Roll back permanently, OR
- [ ] Adjust guardrail if false positive

### Step 6: Document
- [ ] Record what happened
- [ ] Note what you learned
- [ ] Update guardrail thresholds if needed

Common False Positives

False Positive How to Identify Prevention
Seasonal spike Affects control too Compare vs control, not just baseline
Data delay Resolves in minutes Wait for data freshness
Infrastructure issue Affects all experiments Cross-experiment correlation
Metric calculation bug Investigation Validate metric pipeline

R Implementation

# Function to check guardrails
check_guardrails <- function(
  treatment_metrics,
  control_metrics,
  thresholds
) {
  results <- list()

  for (metric_name in names(thresholds)) {
    treatment_val <- treatment_metrics[[metric_name]]
    control_val <- control_metrics[[metric_name]]
    threshold <- thresholds[[metric_name]]

    # Calculate relative difference
    rel_diff <- (treatment_val - control_val) / control_val * 100

    # Check against threshold
    if (rel_diff > threshold) {
      status <- "VIOLATED"
    } else if (rel_diff > threshold * 0.8) {
      status <- "WARNING"
    } else {
      status <- "OK"
    }

    results[[metric_name]] <- list(
      treatment = treatment_val,
      control = control_val,
      difference_pct = rel_diff,
      threshold_pct = threshold,
      status = status
    )
  }

  return(results)
}

# Example
treatment <- list(error_rate = 0.025, latency_p95 = 220)
control <- list(error_rate = 0.020, latency_p95 = 200)
thresholds <- list(error_rate = 20, latency_p95 = 15)  # % increase

results <- check_guardrails(treatment, control, thresholds)

for (metric in names(results)) {
  r <- results[[metric]]
  cat(sprintf("\n%s: %s\n", toupper(metric), r$status))
  cat(sprintf("  Treatment: %.3f, Control: %.3f\n", r$treatment, r$control))
  cat(sprintf("  Difference: %.1f%% (threshold: %.1f%%)\n",
              r$difference_pct, r$threshold_pct))
}

Pre-Launch Checklist

## Guardrail Pre-Launch Checklist

### Guardrail Definition
- [ ] Primary guardrails identified (error rate, latency, crashes)
- [ ] Secondary guardrails based on feature type
- [ ] Thresholds set based on historical variance
- [ ] Thresholds documented in experiment plan

### Monitoring Setup
- [ ] Real-time dashboards configured
- [ ] Alerts configured with correct thresholds
- [ ] Alert routing to correct channels/people
- [ ] Tested alert system with fake data

### Ramp Plan
- [ ] Ramp schedule defined
- [ ] Criteria for advancing to next stage
- [ ] Rollback plan documented
- [ ] Owner for each ramp decision

### Response Plan
- [ ] On-call rotation for experiment period
- [ ] Escalation path documented
- [ ] Rollback procedure tested
- [ ] Post-mortem template ready


Key Takeaway

Guardrails protect your experiments, your users, and your credibility. Define them before launch: metrics that must not regress (error rates, latency, crashes, revenue), thresholds that trigger alerts (typically 10-20% increase), and clear actions for violations (pause, investigate, rollback). Ramp exposure gradually—1% catches bugs, 10% validates at scale, 50% confirms consistency—with automated monitoring at each stage. When a guardrail fires, take it seriously: investigate before dismissing as noise. The cost of a false alarm is inconvenience; the cost of shipping a broken feature is user harm and lost trust. Err on the side of caution.


References

  1. https://www.microsoft.com/en-us/research/publication/top-challenges-from-the-first-practical-online-controlled-experiments-summit/
  2. https://arxiv.org/abs/1710.08217
  3. https://www.exp-platform.com/Documents/controlledExperimentDMKD.pdf
  4. Kohavi, R., Tang, D., & Xu, Y. (2020). *Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing*. Cambridge University Press.
  5. Fabijan, A., et al. (2017). The benefits of controlled experimentation at scale. *ICSE-SEIP*, 137-146.
  6. Deng, A., et al. (2017). Trustworthy analysis of online A/B tests: Pitfalls, challenges and solutions. *WSDM*, 641-649.

Frequently Asked Questions

What metrics should be guardrails?
Metrics where regression would be unacceptable regardless of primary metric gains: error rates, latency, crash rates, support tickets, revenue (for features not expected to affect it). Also include metrics that indicate data quality issues like sample ratio mismatch.
How sensitive should guardrail thresholds be?
Balance false alarms vs missed problems. Too sensitive (1% error increase) = constant alerts on noise. Too loose (50% increase) = real problems slip through. Start with meaningful thresholds (10-20% for most metrics) and calibrate based on experience.
When should I stop an experiment early?
Stop immediately for: guardrail violations, data quality issues (SRM), safety concerns. Consider stopping for: clear negative primary result, external events invalidating the experiment. Don't stop for: primary result not yet significant (be patient), stakeholder impatience.

Key Takeaway

Guardrails protect your experiments, your users, and your credibility. Define them before launch: metrics that must not regress (error rates, latency, crashes), thresholds that trigger alerts (20% increase), and actions for violations (pause, investigate, rollback). Ramp exposure gradually—1% catches bugs, 10% validates at scale, 50% confirms consistency—with automated monitoring at each stage. When a guardrail fires, take it seriously: investigate before dismissing as noise. The cost of a false alarm is low; the cost of shipping a broken feature is high.

Send to a friend

Share this with someone who loves clean statistical work.