Contents
Experiment Guardrails: Stopping Rules, Ramp Criteria, and Managing Risk
Protect your experiments and users with proper guardrails. Learn when to stop an experiment, how to safely ramp exposure, and what metrics should trigger automatic rollback.
Quick Hits
- •Guardrails = metrics that trigger alerts or automatic rollback when violated
- •Define stopping rules before launch: 'Stop if error rate increases >20%'
- •Ramp gradually: 1% → 5% → 25% → 50% → 100%, with checks at each stage
- •Distinguish between 'ship' metrics (must improve) and 'guardrail' metrics (must not regress)
- •Automated monitoring catches problems faster than manual review
TL;DR
Guardrails protect experiments from causing harm: metrics that must not regress (error rates, latency), thresholds that trigger alerts (>10% increase), and automatic rollback rules. Define them before launch. Ramp exposure gradually (1% → 5% → 25% → 50% → 100%) with checks at each stage. When a guardrail fires, investigate immediately—the cost of a false alarm is low; the cost of shipping a broken feature is high.
What Are Guardrails?
Definition
Guardrails = Metrics that must not significantly regress, regardless of primary metric performance.
Primary Metric: "What we want to improve"
Example: Conversion rate
Guardrail Metrics: "What we must not break"
Examples: Error rate, latency, crashes, revenue
The Guardrail Mentality
| Without Guardrails | With Guardrails |
|---|---|
| "Conversion up 5%! Ship it!" | "Conversion up 5%, but error rate up 30%. Investigate." |
| Problems discovered post-launch | Problems caught during experiment |
| Reactive incident response | Proactive risk management |
| "Why didn't we catch this?" | "Guardrail triggered—good thing we checked" |
Types of Guardrails
1. Primary Guardrails (Must Monitor)
These should be in every experiment:
def define_primary_guardrails():
"""
Core guardrails that apply to almost every experiment.
"""
guardrails = {
'error_rate': {
'description': 'Server errors (5xx), client errors',
'threshold': 'No increase > 10%',
'rationale': 'User experience, data integrity'
},
'latency_p95': {
'description': '95th percentile page load time',
'threshold': 'No increase > 15%',
'rationale': 'User experience, SEO impact'
},
'crash_rate': {
'description': 'App crashes per session',
'threshold': 'No increase > 5%',
'rationale': 'Severe user impact'
},
'sample_ratio': {
'description': 'Ratio of users in each variant',
'threshold': 'SRM test p > 0.001',
'rationale': 'Data quality, randomization bugs'
}
}
print("PRIMARY GUARDRAILS")
print("=" * 60)
for name, config in guardrails.items():
print(f"\n{name.upper()}")
print(f" Description: {config['description']}")
print(f" Threshold: {config['threshold']}")
print(f" Why: {config['rationale']}")
return guardrails
define_primary_guardrails()
2. Secondary Guardrails (Context-Dependent)
Add based on your feature:
def define_secondary_guardrails(feature_type):
"""
Context-dependent guardrails based on feature type.
"""
guardrail_sets = {
'checkout_flow': [
{'metric': 'payment_errors', 'threshold': 'No increase > 5%'},
{'metric': 'abandoned_carts', 'threshold': 'No increase > 10%'},
{'metric': 'support_tickets', 'threshold': 'No increase > 20%'},
{'metric': 'refund_rate', 'threshold': 'No increase > 5%'},
],
'search_ranking': [
{'metric': 'zero_result_rate', 'threshold': 'No increase > 5%'},
{'metric': 'time_to_first_click', 'threshold': 'No increase > 15%'},
{'metric': 'query_abandonment', 'threshold': 'No increase > 10%'},
],
'notification': [
{'metric': 'unsubscribe_rate', 'threshold': 'No increase > 10%'},
{'metric': 'spam_reports', 'threshold': 'No increase > 5%'},
{'metric': 'app_uninstalls', 'threshold': 'No increase > 5%'},
],
'performance': [
{'metric': 'memory_usage', 'threshold': 'No increase > 10%'},
{'metric': 'battery_drain', 'threshold': 'No increase > 10%'},
{'metric': 'data_usage', 'threshold': 'No increase > 20%'},
]
}
guardrails = guardrail_sets.get(feature_type, [])
print(f"GUARDRAILS FOR: {feature_type.upper()}")
print("=" * 50)
for g in guardrails:
print(f" • {g['metric']}: {g['threshold']}")
return guardrails
define_secondary_guardrails('checkout_flow')
3. Business Guardrails
Metrics where regression has business consequences:
## Business Guardrails
### Revenue
- **Threshold**: No decrease > 2%
- **When to use**: Features not intended to affect revenue
- **Why**: Even unrelated features can accidentally hurt revenue
### Engagement (DAU/MAU)
- **Threshold**: No decrease > 1%
- **When to use**: Any major user-facing change
- **Why**: Proxy for long-term health
### Retention (Day 7, Day 30)
- **Threshold**: No decrease (within measurement precision)
- **When to use**: Onboarding, core experience changes
- **Why**: Leading indicator of churn
Setting Thresholds
Framework for Threshold Selection
import numpy as np
from scipy import stats
def set_guardrail_threshold(historical_data, sensitivity='medium'):
"""
Set guardrail threshold based on historical variability.
Parameters:
-----------
historical_data : array
Historical values of the metric (e.g., daily error rates)
sensitivity : str
'high' (tight), 'medium', or 'low' (loose)
"""
mean = np.mean(historical_data)
std = np.std(historical_data)
# Sensitivity multipliers
multipliers = {
'high': 1.5, # Tight: catch small changes, more false alarms
'medium': 2.5, # Balanced
'low': 4.0 # Loose: only catch big changes, fewer false alarms
}
multiplier = multipliers.get(sensitivity, 2.5)
# Threshold as relative increase from mean
threshold_absolute = mean + multiplier * std
threshold_relative = (threshold_absolute - mean) / mean * 100
return {
'baseline_mean': mean,
'baseline_std': std,
'threshold_absolute': threshold_absolute,
'threshold_relative_pct': threshold_relative,
'sensitivity': sensitivity
}
# Example: Error rate over 30 days
historical_error_rate = np.array([
0.012, 0.011, 0.013, 0.012, 0.014, 0.011, 0.010, 0.013,
0.012, 0.015, 0.011, 0.012, 0.013, 0.011, 0.012, 0.014,
0.010, 0.011, 0.012, 0.013, 0.012, 0.011, 0.012, 0.013,
0.012, 0.011, 0.013, 0.012, 0.011, 0.012
])
for sens in ['high', 'medium', 'low']:
result = set_guardrail_threshold(historical_error_rate, sens)
print(f"\n{sens.upper()} Sensitivity:")
print(f" Baseline: {result['baseline_mean']:.2%} ± {result['baseline_std']:.2%}")
print(f" Threshold: {result['threshold_absolute']:.2%}")
print(f" Alert if: >{result['threshold_relative_pct']:.0f}% increase")
Threshold by Metric Type
| Metric Type | Typical Threshold | Rationale |
|---|---|---|
| Error rate | 10-20% increase | Balance sensitivity/noise |
| Latency (P95) | 15-25% increase | Noticeable to users above this |
| Crash rate | 5-10% increase | High impact, tight threshold |
| Revenue | 2-5% decrease | Business critical |
| Conversion | Context-dependent | May be primary metric |
Stopping Rules
When to Stop Immediately
def should_stop_experiment(guardrail_results):
"""
Determine if experiment should be stopped based on guardrail status.
"""
stop_reasons = []
for metric, result in guardrail_results.items():
# Immediate stop conditions
if result['status'] == 'violated':
if result['severity'] == 'critical':
stop_reasons.append({
'action': 'STOP IMMEDIATELY',
'metric': metric,
'reason': result['reason'],
'urgency': 'NOW'
})
elif result['severity'] == 'warning':
stop_reasons.append({
'action': 'PAUSE AND INVESTIGATE',
'metric': metric,
'reason': result['reason'],
'urgency': '24 hours'
})
if stop_reasons:
print("⚠️ STOPPING RULES TRIGGERED")
print("=" * 50)
for reason in stop_reasons:
print(f"\n{reason['action']} ({reason['urgency']})")
print(f" Metric: {reason['metric']}")
print(f" Reason: {reason['reason']}")
else:
print("✓ All guardrails passing")
return stop_reasons
# Example
guardrail_results = {
'error_rate': {
'status': 'violated',
'severity': 'critical',
'reason': 'Error rate increased 45% (threshold: 20%)'
},
'latency_p95': {
'status': 'ok',
'severity': None,
'reason': 'Within threshold'
},
'srm': {
'status': 'violated',
'severity': 'critical',
'reason': 'Sample ratio mismatch detected (p < 0.001)'
}
}
should_stop_experiment(guardrail_results)
Stopping Rule Framework
## Pre-Defined Stopping Rules
### STOP IMMEDIATELY
- SRM detected (p < 0.001)
- Error rate increase > 50%
- Crash rate increase > 25%
- Revenue decrease > 10%
- Any safety-related regression
### PAUSE AND INVESTIGATE (24 hours)
- Error rate increase > 20%
- Latency increase > 30%
- Significant support ticket increase
- Unexplained metric anomalies
### CONTINUE WITH MONITORING
- Error rate increase 10-20% (close to threshold)
- Minor latency increase
- Primary metric not yet significant (expected)
### DO NOT STOP FOR
- Primary metric not significant yet (patience)
- Stakeholder wants early results
- Competitor pressure
- Point estimate is negative but CI includes positive
Ramp Criteria
The Ramp Schedule
def create_ramp_schedule(feature_risk='medium'):
"""
Create exposure ramp schedule based on feature risk.
"""
schedules = {
'low': [
{'pct': 5, 'duration': '1 day', 'checks': ['basic']},
{'pct': 25, 'duration': '2 days', 'checks': ['basic', 'metrics']},
{'pct': 100, 'duration': 'ongoing', 'checks': ['full']},
],
'medium': [
{'pct': 1, 'duration': '1 day', 'checks': ['basic']},
{'pct': 5, 'duration': '2 days', 'checks': ['basic', 'metrics']},
{'pct': 25, 'duration': '3 days', 'checks': ['full']},
{'pct': 50, 'duration': '3 days', 'checks': ['full']},
{'pct': 100, 'duration': 'ongoing', 'checks': ['full']},
],
'high': [
{'pct': 0.1, 'duration': '1 day', 'checks': ['manual']},
{'pct': 1, 'duration': '2 days', 'checks': ['basic', 'manual']},
{'pct': 5, 'duration': '3 days', 'checks': ['full']},
{'pct': 10, 'duration': '3 days', 'checks': ['full']},
{'pct': 25, 'duration': '5 days', 'checks': ['full']},
{'pct': 50, 'duration': '5 days', 'checks': ['full']},
{'pct': 100, 'duration': 'ongoing', 'checks': ['full']},
]
}
schedule = schedules.get(feature_risk, schedules['medium'])
print(f"RAMP SCHEDULE ({feature_risk.upper()} risk)")
print("=" * 60)
for stage in schedule:
print(f"\n{stage['pct']:>5}% exposure for {stage['duration']}")
print(f" Checks: {', '.join(stage['checks'])}")
return schedule
create_ramp_schedule('high')
What to Check at Each Stage
## Ramp Check Types
### 'basic' (1% exposure)
- Feature loads without errors
- No obvious bugs in logs
- SRM check passing
### 'metrics' (5-10% exposure)
- Guardrail metrics within bounds
- No anomalous patterns
- SRM still passing
### 'full' (25%+ exposure)
- All guardrail metrics checked with statistical tests
- Primary metric trending as expected (or at least not negative)
- User feedback (if available)
- Support ticket volume
### 'manual' (high-risk features)
- Manual QA of user sessions
- Review of edge cases
- Stakeholder sign-off before next stage
Automated Ramp Decision
def ramp_decision(current_pct, guardrail_results, days_at_current):
"""
Decide whether to ramp up, hold, or roll back.
"""
# Count guardrail statuses
violations = sum(1 for g in guardrail_results.values()
if g['status'] == 'violated')
warnings = sum(1 for g in guardrail_results.values()
if g['status'] == 'warning')
# Decision logic
if violations > 0:
return {
'decision': 'ROLL BACK',
'reason': f'{violations} guardrail violation(s)',
'action': 'Reduce to previous stage or 0%'
}
elif warnings >= 2:
return {
'decision': 'HOLD',
'reason': f'{warnings} guardrail warnings',
'action': 'Investigate before proceeding'
}
elif days_at_current < 2 and current_pct < 100:
return {
'decision': 'HOLD',
'reason': 'Minimum duration not met',
'action': f'Wait {2 - days_at_current} more day(s)'
}
else:
return {
'decision': 'RAMP UP',
'reason': 'All checks passing',
'action': 'Proceed to next exposure level'
}
# Example
decision = ramp_decision(
current_pct=5,
guardrail_results={
'error_rate': {'status': 'ok'},
'latency': {'status': 'warning'},
'srm': {'status': 'ok'}
},
days_at_current=3
)
print(f"Decision: {decision['decision']}")
print(f"Reason: {decision['reason']}")
print(f"Action: {decision['action']}")
Automated Monitoring
Monitoring System Requirements
def design_monitoring_system():
"""
Key components of an experiment monitoring system.
"""
components = {
'real_time_metrics': {
'latency': '< 5 min delay',
'errors': '< 5 min delay',
'crashes': '< 15 min delay',
'update_frequency': 'Every 5 minutes'
},
'batch_metrics': {
'conversion': 'Daily',
'revenue': 'Daily',
'retention': 'Weekly',
'update_frequency': 'Once per day'
},
'alerts': {
'channels': ['Slack', 'PagerDuty', 'Email'],
'escalation': 'Auto-page on-call for critical',
'deduplication': 'Alert once per threshold cross'
},
'dashboards': {
'real_time': 'Current guardrail status',
'trends': 'Metric trends over experiment',
'comparison': 'Control vs Treatment plots'
}
}
print("MONITORING SYSTEM DESIGN")
print("=" * 60)
for component, config in components.items():
print(f"\n{component.upper()}")
for key, value in config.items():
print(f" {key}: {value}")
design_monitoring_system()
Alert Configuration
# Example alert configuration
alerts:
- name: "Error Rate Critical"
metric: error_rate
condition: "> baseline * 1.5" # 50% increase
severity: critical
action: "page_oncall"
- name: "Error Rate Warning"
metric: error_rate
condition: "> baseline * 1.2" # 20% increase
severity: warning
action: "slack_channel"
- name: "SRM Detected"
metric: sample_ratio_mismatch
condition: "p_value < 0.001"
severity: critical
action: "pause_experiment"
- name: "Latency Degradation"
metric: latency_p95
condition: "> baseline * 1.25"
severity: warning
action: "slack_channel"
- name: "Revenue Drop"
metric: revenue_per_user
condition: "< baseline * 0.95" # 5% decrease
severity: critical
action: "page_oncall"
Responding to Guardrail Violations
The Response Playbook
## Guardrail Violation Response Playbook
### Step 1: Verify (5 minutes)
- [ ] Is the alert real or a data glitch?
- [ ] Check metric source for data quality
- [ ] Verify experiment is still running correctly
### Step 2: Assess Severity (5 minutes)
- [ ] How far above threshold?
- [ ] How many users affected?
- [ ] Is it getting worse?
### Step 3: Decide (5 minutes)
- [ ] CRITICAL: Pause experiment immediately
- [ ] WARNING: Continue monitoring, prepare to pause
- [ ] FALSE ALARM: Document and close
### Step 4: Investigate (if paused)
- [ ] Check logs for errors
- [ ] Compare control vs treatment behavior
- [ ] Identify root cause
- [ ] Determine if fixable
### Step 5: Resolve
- [ ] Fix issue and resume, OR
- [ ] Roll back permanently, OR
- [ ] Adjust guardrail if false positive
### Step 6: Document
- [ ] Record what happened
- [ ] Note what you learned
- [ ] Update guardrail thresholds if needed
Common False Positives
| False Positive | How to Identify | Prevention |
|---|---|---|
| Seasonal spike | Affects control too | Compare vs control, not just baseline |
| Data delay | Resolves in minutes | Wait for data freshness |
| Infrastructure issue | Affects all experiments | Cross-experiment correlation |
| Metric calculation bug | Investigation | Validate metric pipeline |
R Implementation
# Function to check guardrails
check_guardrails <- function(
treatment_metrics,
control_metrics,
thresholds
) {
results <- list()
for (metric_name in names(thresholds)) {
treatment_val <- treatment_metrics[[metric_name]]
control_val <- control_metrics[[metric_name]]
threshold <- thresholds[[metric_name]]
# Calculate relative difference
rel_diff <- (treatment_val - control_val) / control_val * 100
# Check against threshold
if (rel_diff > threshold) {
status <- "VIOLATED"
} else if (rel_diff > threshold * 0.8) {
status <- "WARNING"
} else {
status <- "OK"
}
results[[metric_name]] <- list(
treatment = treatment_val,
control = control_val,
difference_pct = rel_diff,
threshold_pct = threshold,
status = status
)
}
return(results)
}
# Example
treatment <- list(error_rate = 0.025, latency_p95 = 220)
control <- list(error_rate = 0.020, latency_p95 = 200)
thresholds <- list(error_rate = 20, latency_p95 = 15) # % increase
results <- check_guardrails(treatment, control, thresholds)
for (metric in names(results)) {
r <- results[[metric]]
cat(sprintf("\n%s: %s\n", toupper(metric), r$status))
cat(sprintf(" Treatment: %.3f, Control: %.3f\n", r$treatment, r$control))
cat(sprintf(" Difference: %.1f%% (threshold: %.1f%%)\n",
r$difference_pct, r$threshold_pct))
}
Pre-Launch Checklist
## Guardrail Pre-Launch Checklist
### Guardrail Definition
- [ ] Primary guardrails identified (error rate, latency, crashes)
- [ ] Secondary guardrails based on feature type
- [ ] Thresholds set based on historical variance
- [ ] Thresholds documented in experiment plan
### Monitoring Setup
- [ ] Real-time dashboards configured
- [ ] Alerts configured with correct thresholds
- [ ] Alert routing to correct channels/people
- [ ] Tested alert system with fake data
### Ramp Plan
- [ ] Ramp schedule defined
- [ ] Criteria for advancing to next stage
- [ ] Rollback plan documented
- [ ] Owner for each ramp decision
### Response Plan
- [ ] On-call rotation for experiment period
- [ ] Escalation path documented
- [ ] Rollback procedure tested
- [ ] Post-mortem template ready
Related Articles
- Analytics Reporting (Pillar) - Complete reporting guide
- SRM Detection - Sample ratio mismatch
- Sequential Testing - Valid early stopping
- When to Say Inconclusive - Decision rules
Key Takeaway
Guardrails protect your experiments, your users, and your credibility. Define them before launch: metrics that must not regress (error rates, latency, crashes, revenue), thresholds that trigger alerts (typically 10-20% increase), and clear actions for violations (pause, investigate, rollback). Ramp exposure gradually—1% catches bugs, 10% validates at scale, 50% confirms consistency—with automated monitoring at each stage. When a guardrail fires, take it seriously: investigate before dismissing as noise. The cost of a false alarm is inconvenience; the cost of shipping a broken feature is user harm and lost trust. Err on the side of caution.
References
- https://www.microsoft.com/en-us/research/publication/top-challenges-from-the-first-practical-online-controlled-experiments-summit/
- https://arxiv.org/abs/1710.08217
- https://www.exp-platform.com/Documents/controlledExperimentDMKD.pdf
- Kohavi, R., Tang, D., & Xu, Y. (2020). *Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing*. Cambridge University Press.
- Fabijan, A., et al. (2017). The benefits of controlled experimentation at scale. *ICSE-SEIP*, 137-146.
- Deng, A., et al. (2017). Trustworthy analysis of online A/B tests: Pitfalls, challenges and solutions. *WSDM*, 641-649.
Frequently Asked Questions
What metrics should be guardrails?
How sensitive should guardrail thresholds be?
When should I stop an experiment early?
Key Takeaway
Guardrails protect your experiments, your users, and your credibility. Define them before launch: metrics that must not regress (error rates, latency, crashes), thresholds that trigger alerts (20% increase), and actions for violations (pause, investigate, rollback). Ramp exposure gradually—1% catches bugs, 10% validates at scale, 50% confirms consistency—with automated monitoring at each stage. When a guardrail fires, take it seriously: investigate before dismissing as noise. The cost of a false alarm is low; the cost of shipping a broken feature is high.