Reporting

The One-Slide Experiment Readout: Five Numbers That Matter

A template for presenting experiment results in one slide. Focus on the five numbers executives actually need to make a decision.

Jan 266 min readstatstest_flow Reporting Supporting

The One-Slide Experiment Readout: Five Numbers That Matter

Quick Hits

•Five numbers: effect size, CI bounds, p-value, sample size, and practical threshold
•Answer the question before explaining the methodology
•Color-code: green for ship, red for don't ship, yellow for inconclusive
•Include one sentence on the key caveat or limitation
•Put methodology details in backup slides, not the main readout

TL;DR

One slide, five numbers, one recommendation. Executives need effect size, confidence interval bounds, p-value, sample size, and how you defined success—not your methodology. Lead with the answer (ship/don't ship/inconclusive), show the key numbers, add one caveat, and save the details for backup slides.

The Five Numbers

What Executives Need

Number	What It Answers
Effect size	"How much did it move?"
CI lower bound	"Could it actually be zero or negative?"
CI upper bound	"How good could it be?"
p-value	"Is this real or noise?"
Sample size	"Do we have enough data?"

Plus: Your pre-specified threshold for success (MDE)

What They Don't Need (On This Slide)

Which statistical test you used
How you calculated standard errors
Details of your randomization scheme
All 15 secondary metrics
Segment breakdowns (save for Q&A)

The Template

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  EXPERIMENT: [Feature Name]                                     │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  RECOMMENDATION: [🟢 SHIP / 🔴 DON'T SHIP / 🟡 INCONCLUSIVE]    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                         │   │
│  │  PRIMARY METRIC: Conversion Rate                        │   │
│  │                                                         │   │
│  │  Treatment: 4.52%     Control: 4.18%                   │   │
│  │                                                         │   │
│  │  LIFT: +8.1%                                           │   │
│  │  95% CI: +3.2% to +13.0%                               │   │
│  │  p-value: 0.001                                        │   │
│  │                                                         │   │
│  │  Success threshold (MDE): +5%                          │   │
│  │  Sample: 125,000 users (100% of target)                │   │
│  │                                                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  GUARDRAILS: ✅ Revenue stable | ✅ Errors stable              │
│                                                                 │
│  CAVEAT: Mobile conversion flat (+1%, not significant);        │
│          gains driven by desktop users                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Color Coding Decisions

Green: Ship

Criteria:

CI lower bound > 0
Effect exceeds MDE (or CI lower bound exceeds MDE)
No guardrail violations
Sufficient sample size

Red: Don't Ship

Criteria:

CI upper bound < 0 (clear negative effect), OR
Guardrail violations, OR
Clear negative despite possible positive edge

Yellow: Inconclusive

Criteria:

CI includes both meaningful positive and meaningful negative
Insufficient sample size
Conflicting metrics
Needs more data or deliberate risk decision

Implementation

def generate_readout(experiment_name, metric_name,
                      control_value, treatment_value,
                      ci_lower, ci_upper, p_value,
                      sample_size, target_sample,
                      mde, guardrails_ok=True):
    """
    Generate one-slide readout summary.
    """
    # Calculate lift
    lift = (treatment_value - control_value) / control_value * 100

    # Determine recommendation
    if ci_lower > 0 and guardrails_ok:
        if ci_lower > mde:
            recommendation = "🟢 SHIP (high confidence)"
        else:
            recommendation = "🟢 SHIP"
    elif ci_upper < 0:
        recommendation = "🔴 DON'T SHIP"
    elif not guardrails_ok:
        recommendation = "🔴 DON'T SHIP (guardrail violation)"
    elif ci_lower < 0 < ci_upper:
        recommendation = "🟡 INCONCLUSIVE"
    else:
        recommendation = "🟡 NEEDS REVIEW"

    readout = f"""
{'='*60}
EXPERIMENT: {experiment_name}
{'='*60}

RECOMMENDATION: {recommendation}

PRIMARY METRIC: {metric_name}
  Treatment: {treatment_value:.2%}    Control: {control_value:.2%}

  LIFT: {lift:+.1f}%
  95% CI: {ci_lower:+.1f}% to {ci_upper:+.1f}%
  p-value: {p_value:.4f}

  Success threshold (MDE): {mde:+.1f}%
  Sample: {sample_size:,} users ({sample_size/target_sample*100:.0f}% of target)

GUARDRAILS: {'✅ OK' if guardrails_ok else '❌ VIOLATION'}
{'='*60}
"""
    return readout


# Example: Successful experiment
print(generate_readout(
    experiment_name="New Checkout Flow v2",
    metric_name="Conversion Rate",
    control_value=0.0418,
    treatment_value=0.0452,
    ci_lower=3.2,
    ci_upper=13.0,
    p_value=0.001,
    sample_size=125000,
    target_sample=120000,
    mde=5.0,
    guardrails_ok=True
))

# Example: Inconclusive experiment
print(generate_readout(
    experiment_name="Homepage Redesign",
    metric_name="Click-through Rate",
    control_value=0.082,
    treatment_value=0.086,
    ci_lower=-2.1,
    ci_upper=8.9,
    p_value=0.23,
    sample_size=45000,
    target_sample=80000,
    mde=5.0,
    guardrails_ok=True
))

Handling Different Scenarios

Scenario 1: Clear Win

RECOMMENDATION: 🟢 SHIP

LIFT: +12.3% (95% CI: +7.1% to +17.5%, p < 0.001)
MDE was +5%, CI entirely above it.

GUARDRAILS: All stable
CAVEAT: None significant

Scenario 2: Clear Loss

RECOMMENDATION: 🔴 DON'T SHIP

LIFT: -8.2% (95% CI: -14.1% to -2.3%, p = 0.006)
Significant negative effect on conversion.

GUARDRAILS: N/A (primary metric failed)
CAVEAT: Error rate also increased +15%

Scenario 3: Inconclusive

RECOMMENDATION: 🟡 INCONCLUSIVE

LIFT: +4.1% (95% CI: -1.8% to +10.0%, p = 0.17)
Cannot rule out negative effect; cannot confirm positive.

OPTIONS:
A) Extend 2 weeks for conclusive result
B) Ship accepting ~20% chance of no/negative effect
C) Abandon and reallocate resources

RECOMMENDATION: Option A (extend)

Scenario 4: Guardrail Failure

RECOMMENDATION: 🔴 DON'T SHIP (guardrail)

PRIMARY: +6.2% (95% CI: +2.1% to +10.3%, p = 0.003)
Primary metric positive and significant.

GUARDRAIL VIOLATION:
❌ Error rate: +45% (threshold: <10% increase)

CAVEAT: Conversion gains may be driven by users who
couldn't complete purchase due to errors.

Backup Slides

Slide 2: Secondary Metrics

┌─────────────────────────────────────────────────────────────┐
│  SECONDARY METRICS                                          │
├────────────────┬──────────┬────────────────────┬───────────┤
│ Metric         │ Lift     │ 95% CI             │ p-value   │
├────────────────┼──────────┼────────────────────┼───────────┤
│ Revenue/User   │ +5.2%    │ (-1.8%, +12.2%)    │ 0.14      │
│ Sessions/User  │ +2.1%    │ (-0.5%, +4.7%)     │ 0.11      │
│ Time on Site   │ -3.4%    │ (-7.2%, +0.4%)     │ 0.08      │
│ Bounce Rate    │ -1.2%    │ (-4.1%, +1.7%)     │ 0.42      │
└────────────────┴──────────┴────────────────────┴───────────┘
│                                                             │
│ NOTE: No secondary metrics significant. Directionally       │
│ consistent with primary metric.                             │
└─────────────────────────────────────────────────────────────┘

Slide 3: Segment Analysis

┌─────────────────────────────────────────────────────────────┐
│  SEGMENT ANALYSIS (Pre-specified)                           │
├────────────────┬──────────┬────────────────────┬───────────┤
│ Segment        │ Lift     │ 95% CI             │ p-value   │
├────────────────┼──────────┼────────────────────┼───────────┤
│ Desktop        │ +11.2%   │ (+5.4%, +17.0%)    │ <0.001    │
│ Mobile         │ +1.3%    │ (-5.2%, +7.8%)     │ 0.69      │
│ New Users      │ +9.8%    │ (+3.1%, +16.5%)    │ 0.004     │
│ Returning      │ +6.4%    │ (+0.8%, +12.0%)    │ 0.03      │
└────────────────┴──────────┴────────────────────┴───────────┘
│                                                             │
│ NOTE: Effect appears desktop-driven. Mobile showed no       │
│ significant effect (test for interaction: p = 0.08).        │
│ Consider mobile-specific follow-up.                         │
└─────────────────────────────────────────────────────────────┘

Slide 4: Methodology

┌─────────────────────────────────────────────────────────────┐
│  METHODOLOGY                                                │
├─────────────────────────────────────────────────────────────┤
│ Design: User-level randomization, 50/50 split              │
│ Period: Jan 15 - Jan 28, 2026 (14 days)                    │
│ Analysis: Two-proportion z-test, α = 0.05                  │
│ Pre-registration: [Link to doc]                            │
│                                                             │
│ Data Quality:                                               │
│ ✅ SRM check passed (p = 0.41)                             │
│ ✅ No assignment bugs detected                              │
│ ✅ Exposure logging validated                               │
│                                                             │
│ Deviations from Plan: None                                  │
└─────────────────────────────────────────────────────────────┘

Verbal Presentation Tips

The 30-Second Version

"We tested the new checkout flow and recommend shipping. Conversion increased 8%, and we're confident the true effect is at least 3%—well above our 5% threshold. The main caveat: gains are desktop-driven; mobile was flat. Any questions?"

Anticipate These Questions

Question	Be Ready With
"Is this real?"	p-value, sample size, SRM check
"How much could it be?"	Full confidence interval
"What about [other metric]?"	Secondary metrics table
"What about mobile?"	Segment breakdown
"What could go wrong?"	Caveats, guardrails

Analytics Reporting (Pillar) - Full reporting guide
Communicating Uncertainty - Stakeholder strategies
When to Say Inconclusive - Decision rules
Writing Methods Sections - Documentation

Key Takeaway

The one-slide readout is about decisions, not statistics. Lead with your recommendation (ship/don't ship/inconclusive), show five key numbers (effect, CI bounds, p-value, sample), note one caveat, and stop. Executives will ask questions if they want more—that's what backup slides are for. If you can't fit your conclusion on one slide, you're either overcomplicating the presentation or the results genuinely need more nuance (which itself is a finding worth stating clearly).

References

https://www.microsoft.com/en-us/research/publication/top-challenges-from-the-first-practical-online-controlled-experiments-summit/
https://www.exp-platform.com/Documents/controlledExperimentDMKD.pdf
Kohavi, R., Longbotham, R., Sommerfield, D., & Henne, R. M. (2009). Controlled experiments on the web: survey and practical guide. *Data Mining and Knowledge Discovery*, 18(1), 140-181.
Dmitriev, P., Gupta, S., Kim, D. W., & Vaz, G. (2017). A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments. *KDD*, 1427-1436.

Frequently Asked Questions

What if I have multiple metrics?

Lead with your primary metric on the main slide. Put secondary metrics in a supporting table or backup slide. If a secondary metric contradicts the primary, mention it in the caveat.

How do I present inconclusive results?

Be direct: 'Inconclusive - CI includes both positive and negative effects.' Then present options: extend experiment, ship with risk acknowledgment, or abandon. Don't try to spin inconclusive as positive.

Should I include statistical details?

Include enough for credibility (p-value, CI, sample size) but save methodology details for backup slides or the appendix. Most executives don't need to know you used Welch's t-test.

Key Takeaway

Executives need to make decisions, not understand statistics. Give them five numbers: the effect size, confidence interval bounds, p-value, sample size, and your pre-specified success threshold. Add one sentence on the key caveat. Put everything else in backup slides. If you can't fit it on one slide, you're overcomplicating it.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email