Contents
Analytics Reporting That Doesn't Get You Killed in Review
How to communicate statistical results to stakeholders without getting destroyed in review. Templates, common mistakes, and strategies for building trust through transparency.
Quick Hits
- •Lead with the answer, not the methodology
- •Report confidence intervals, not just point estimates
- •Pre-register your analysis plan to avoid accusations of p-hacking
- •Know when to say 'inconclusive' instead of forcing a narrative
- •Document everything—future you will thank present you
TL;DR
Analytics reporting is about building trust, not winning arguments. Lead with findings (not methods), quantify uncertainty (confidence intervals, not just point estimates), pre-register your analysis plan (avoid p-hacking accusations), and have the courage to report inconclusive results honestly. This guide covers templates, common mistakes, and strategies for creating reports that survive scrutiny and build your credibility over time.
The Stakes
What Happens in Review
Your analysis will be questioned on:
- Methodology: Did you analyze this correctly?
- Integrity: Did you p-hack or cherry-pick?
- Completeness: Did you look at what matters?
- Interpretation: Are your conclusions justified?
What Builds Trust
| Destroys Trust | Builds Trust |
|---|---|
| Hiding uncertainty | Quantifying uncertainty |
| Post-hoc rationalization | Pre-registered plans |
| Overstating conclusions | Measured, qualified claims |
| Missing obvious questions | Anticipating objections |
| Defensive reactions | Welcoming scrutiny |
Part 1: Structure Your Reports
The Inverted Pyramid
┌─────────────────────────────────────────┐
│ EXECUTIVE SUMMARY │ ← Answer first
│ (1-2 sentences, key finding) │
├─────────────────────────────────────────┤
│ KEY METRICS │ ← The numbers
│ (with confidence intervals) │
├─────────────────────────────────────────┤
│ SUPPORTING ANALYSIS │ ← Context and nuance
│ (segments, sensitivity, guardrails) │
├─────────────────────────────────────────┤
│ METHODOLOGY │ ← For the skeptics
│ (how you did it, assumptions) │
├─────────────────────────────────────────┤
│ APPENDIX │ ← Deep details
│ (data checks, full tables, code) │
└─────────────────────────────────────────┘
Executive Summary Template
## Executive Summary
**Result**: [Treatment/Feature/Change] [increased/decreased/had no significant effect on]
[metric] by [X]% (95% CI: [Y]% to [Z]%).
**Recommendation**: [Ship/Don't ship/Collect more data] because [one sentence justification].
**Key caveat**: [One sentence on the most important limitation or uncertainty].
Example:
Result: The new checkout flow increased conversion rate by 3.2% (95% CI: 1.1% to 5.3%, p < 0.01).
Recommendation: Ship to all users based on significant lift exceeding our 2% MDE threshold.
Key caveat: Revenue per user showed no significant change; monitor post-launch.
Part 2: Quantify Uncertainty
Always Report Confidence Intervals
import numpy as np
from scipy import stats
def format_result(estimate, ci_lower, ci_upper, p_value, metric_name):
"""
Format a result with proper uncertainty reporting.
"""
# Determine significance
significant = ci_lower > 0 or ci_upper < 0
# Format the result
result = f"{metric_name}: {estimate:+.1%} (95% CI: {ci_lower:+.1%} to {ci_upper:+.1%})"
if p_value is not None:
result += f", p = {p_value:.3f}"
if significant:
result += " *"
return result
# Example
results = [
("Conversion Rate", 0.032, 0.011, 0.053, 0.003),
("Revenue per User", 0.018, -0.012, 0.048, 0.24),
("Page Load Time", -0.08, -0.15, -0.01, 0.02),
]
print("Experiment Results")
print("=" * 60)
for metric, est, ci_l, ci_u, p in results:
print(format_result(est, ci_l, ci_u, p, metric))
print("\n* Statistically significant at α = 0.05")
Avoid These Patterns
❌ Bad: "Conversion increased by 3.2%" ✅ Good: "Conversion increased by 3.2% (95% CI: 1.1% to 5.3%)"
❌ Bad: "p < 0.05, so it's significant" ✅ Good: "p = 0.003, CI excludes zero, effect size exceeds our MDE"
❌ Bad: "The experiment was successful" ✅ Good: "The experiment detected a significant positive effect on conversion; revenue was inconclusive"
Part 3: Pre-Registration
Why Pre-Register
Pre-registration documents your analysis plan before seeing results:
- Prevents accusations of p-hacking
- Forces clear thinking about metrics and methods
- Creates accountability
- Builds credibility over time
Lightweight Pre-Registration Template
# Experiment Pre-Analysis Plan
## Experiment Info
- **Name**: New Checkout Flow v2
- **Start Date**: 2026-01-15
- **Planned End Date**: 2026-01-29 (or when powered)
- **Author**: [Your name]
- **Date Written**: 2026-01-14
## Hypothesis
The simplified checkout flow will increase conversion rate by reducing friction.
## Primary Metric
- **Metric**: Purchase conversion rate (purchases / sessions with cart)
- **MDE**: 2% relative lift
- **Analysis**: Two-proportion z-test, α = 0.05
## Secondary Metrics
1. Revenue per user
2. Cart abandonment rate
3. Time to purchase
## Guardrail Metrics
1. Page errors (should not increase)
2. Support tickets (should not increase)
## Sample Size
- Required: 50,000 per group (calculated at 80% power)
- Decision rule: Run until powered or 14 days, whichever comes first
## Analysis Plan
1. Check for SRM before analyzing results
2. Report ITT (intent-to-treat) as primary analysis
3. Segment analysis: new vs. returning users (pre-specified)
## What Constitutes Success
- Primary metric CI lower bound > 0
- No guardrail metric degradation
What to Lock Down
| Always Pre-Specify | Can Be Exploratory |
|---|---|
| Primary metric | Additional segments |
| Sample size / stopping rule | Secondary correlations |
| Success criteria | Interaction effects |
| Guardrails | Post-hoc investigations |
| Key segments | Subgroup deep-dives |
Part 4: Handle Inconclusive Results
When Results Are Inconclusive
An experiment is inconclusive when:
- CI includes both positive and negative effects of interest
- Sample size was insufficient
- External factors corrupted the experiment
How to Report Inconclusive
## Result: Inconclusive
The experiment did not reach a conclusive result.
### What We Observed
- Conversion: +2.1% (95% CI: -1.8% to +6.0%)
- The point estimate is positive but the CI includes our MDE of 2%
and zero
### Why Inconclusive
- Achieved 65% of planned sample size due to early feature freeze
- Would need ~3 more weeks to achieve 80% power
### Recommendation
**Option A**: Extend experiment for 3 weeks
- Pro: Conclusive result
- Con: Delays roadmap
**Option B**: Ship based on directionally positive results
- Pro: Ship now
- Con: ~30% chance the true effect is negative
**Option C**: Abandon and move on
- Pro: Free up resources
- Con: Miss potential 2-6% lift
The Courage to Say "I Don't Know"
| Temptation | Better Response |
|---|---|
| "Results were positive" (but not significant) | "Results were directionally positive but inconclusive" |
| "No significant difference" (framing as no effect) | "We could not detect a difference with available data" |
| "The treatment works" (based on p = 0.06) | "Suggestive evidence but below significance threshold" |
Part 5: Common Mistakes and How to Avoid Them
Mistake 1: Multiple Testing Without Correction
def demonstrate_multiple_testing():
"""
Show how uncorrected multiple testing inflates false positives.
"""
print("Multiple Testing Problem")
print("=" * 50)
print("\nYou test 10 metrics. None have real effects.")
print("At α = 0.05, expected false positives: 0.5")
print("Probability of at least one false positive: 40%")
print("\nSolution: Pre-specify primary metric, correct for others")
print(" Or clearly label as 'exploratory'")
demonstrate_multiple_testing()
Mistake 2: Cherry-Picking Segments
The Pattern:
- Overall result is flat
- Find a segment where treatment wins
- Report segment result as main finding
The Fix:
- Pre-specify segments of interest
- Report all pre-specified segments, not just winners
- Label post-hoc findings as "exploratory"
Mistake 3: Ignoring Guardrails
Bad: "Conversion up 5%! Ship it!" Good: "Conversion up 5%, but revenue flat and support tickets up 20%. Investigate before shipping."
Mistake 4: Over-Interpreting Segment Differences
def segment_comparison_warning():
"""
Comparing segments requires testing the interaction.
"""
print("Segment Comparison Warning")
print("=" * 50)
print("\nYou observe:")
print(" Mobile: +5% (significant)")
print(" Desktop: +2% (not significant)")
print("\n❌ Wrong conclusion: 'Treatment works better on mobile'")
print("✅ Right approach: Test interaction (is 5% vs 2% significant?)")
print(" Often, both are consistent with a ~3.5% overall effect")
segment_comparison_warning()
Part 6: Documentation and Audit Trails
What to Document
# Analysis Log: Checkout Experiment
## Data Pull
- **Date**: 2026-01-30
- **Query**: `checkout_experiment_v2.sql`
- **Rows**: 145,892 sessions
- **Filters applied**: Excluded bot traffic (user_agent filter)
## Data Quality Checks
- [x] SRM check: Control 50.2%, Treatment 49.8% (p = 0.34, OK)
- [x] No duplicate user assignments
- [x] Exposure dates match experiment config
## Analysis Decisions
1. Excluded users with 0 sessions (n=234) — no exposure
2. Used ITT (intent-to-treat) — assigned, not exposed
3. Winsorized revenue at 99th percentile — 3 outliers
## Deviations from Pre-Analysis Plan
- Originally planned 14 days, extended to 16 due to holiday traffic
- Added post-hoc analysis of mobile vs desktop (labeled exploratory)
## Reviewer Notes
- [2026-01-31] @reviewer: Confirmed SRM check independently
- [2026-02-01] @analyst: Added sensitivity without Winsorization
Version Control for Analyses
# Good practices
git commit -m "Initial analysis: checkout experiment results"
git commit -m "Add sensitivity analysis per reviewer request"
git commit -m "Fix: correct date filter (was off by 1 day)"
# Tag final versions
git tag -a "checkout-exp-final" -m "Final analysis shared with leadership"
Part 7: Presentation Strategies
Know Your Audience
| Audience | They Care About | Give Them |
|---|---|---|
| Executives | Decision, impact | 1-slide summary, recommendation |
| PMs | Details, tradeoffs | Full report, segment breakdowns |
| Data scientists | Methods, rigor | Appendix, code, methodology |
| Engineers | Implementation | What to build, edge cases |
One-Slide Summary Template
┌─────────────────────────────────────────────────────────────┐
│ EXPERIMENT: [Name] [Date] │
├─────────────────────────────────────────────────────────────┤
│ RESULT: [Win / Loss / Inconclusive] │
│ │
│ PRIMARY METRIC: [Name] │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Treatment: [X]% Control: [Y]% Δ: [Z]% │ │
│ │ 95% CI: [A]% to [B]% p = [C] │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ GUARDRAILS: ✓ No degradation detected │
│ │
│ RECOMMENDATION: [Ship / Don't Ship / Extend] │
│ CONFIDENCE: [High / Medium / Low] │
│ │
│ KEY CAVEAT: [One sentence] │
└─────────────────────────────────────────────────────────────┘
Part 8: Building Long-Term Credibility
The Credibility Flywheel
Pre-register analysis
↓
Run analysis as planned
↓
Report honestly (including negatives)
↓
Welcome scrutiny
↓
Iterate and improve methods
↓
Build reputation for integrity
↓
Future analyses trusted more easily
↓
(Back to pre-register)
Practices That Build Trust
- Admit mistakes quickly: "I found an error in yesterday's analysis..."
- Show your uncertainty: "I'm 60% confident in this interpretation"
- Anticipate objections: "You might ask about X—here's what I found"
- Credit others: "Building on [colleague]'s earlier analysis..."
- Follow up on predictions: "Last quarter I predicted X; actual was Y"
Related Articles
Specific Topics
- One-Slide Experiment Readout Template - Presentation format
- Common Analyst Mistakes - What to avoid
- Writing Methods Sections - Documentation guide
- Communicating Uncertainty - Stakeholder communication
Process
- Pre-Registration Lite - Practical pre-reg
- Audit Trails - Documentation practices
- When to Say Inconclusive - Decision rules
- Experiment Guardrails - Protecting quality
Key Takeaway
Analytics reporting succeeds through transparency, not persuasion. Lead with your finding and recommendation, quantify uncertainty with confidence intervals, pre-register your analysis plan, document your decisions, and have the courage to report inconclusive results honestly. The goal isn't to survive one review—it's to build a reputation where your analyses are trusted because you've consistently been honest about what the data shows and doesn't show.
References
- https://www.exp-platform.com/Documents/2013-02-WSDM-DeltaMethodPaper.pdf
- https://arxiv.org/abs/1903.06372
- https://www.microsoft.com/en-us/research/publication/top-challenges-from-the-first-practical-online-controlled-experiments-summit/
- Kohavi, R., Tang, D., & Xu, Y. (2020). *Trustworthy Online Controlled Experiments*. Cambridge University Press.
- Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments. *WSDM*, 123-132.
- Fabijan, A., Dmitriev, P., Olsson, H. H., & Bosch, J. (2017). The benefits of controlled experimentation at scale. *ICSE-SEIP*, 137-146.
Frequently Asked Questions
How much methodology should I include?
What if my results are inconclusive?
How do I handle pushback on methodology?
Key Takeaway
Good analytics reporting builds trust through transparency, not persuasion. Lead with findings, quantify uncertainty, pre-register your approach, document your assumptions, and have the courage to say 'inconclusive' when appropriate. The goal isn't to survive one review—it's to build a reputation where your analyses are trusted because you've consistently been honest about what you know and don't know.