Reporting

Analytics Reporting That Doesn't Get You Killed in Review

How to communicate statistical results to stakeholders without getting destroyed in review. Templates, common mistakes, and strategies for building trust through transparency.

Jan 269 min readstatstest_flow Reporting Pillar

Analytics Reporting That Doesn't Get You Killed in Review

Quick Hits

•Lead with the answer, not the methodology
•Report confidence intervals, not just point estimates
•Pre-register your analysis plan to avoid accusations of p-hacking
•Know when to say 'inconclusive' instead of forcing a narrative
•Document everything—future you will thank present you

TL;DR

Analytics reporting is about building trust, not winning arguments. Lead with findings (not methods), quantify uncertainty (confidence intervals, not just point estimates), pre-register your analysis plan (avoid p-hacking accusations), and have the courage to report inconclusive results honestly. This guide covers templates, common mistakes, and strategies for creating reports that survive scrutiny and build your credibility over time.

The Stakes

What Happens in Review

Your analysis will be questioned on:

Methodology: Did you analyze this correctly?
Integrity: Did you p-hack or cherry-pick?
Completeness: Did you look at what matters?
Interpretation: Are your conclusions justified?

What Builds Trust

Destroys Trust	Builds Trust
Hiding uncertainty	Quantifying uncertainty
Post-hoc rationalization	Pre-registered plans
Overstating conclusions	Measured, qualified claims
Missing obvious questions	Anticipating objections
Defensive reactions	Welcoming scrutiny

Part 1: Structure Your Reports

The Inverted Pyramid

┌─────────────────────────────────────────┐
│           EXECUTIVE SUMMARY              │  ← Answer first
│         (1-2 sentences, key finding)     │
├─────────────────────────────────────────┤
│              KEY METRICS                 │  ← The numbers
│     (with confidence intervals)          │
├─────────────────────────────────────────┤
│         SUPPORTING ANALYSIS              │  ← Context and nuance
│   (segments, sensitivity, guardrails)    │
├─────────────────────────────────────────┤
│            METHODOLOGY                   │  ← For the skeptics
│     (how you did it, assumptions)        │
├─────────────────────────────────────────┤
│              APPENDIX                    │  ← Deep details
│    (data checks, full tables, code)      │
└─────────────────────────────────────────┘

Executive Summary Template

## Executive Summary

**Result**: [Treatment/Feature/Change] [increased/decreased/had no significant effect on]
[metric] by [X]% (95% CI: [Y]% to [Z]%).

**Recommendation**: [Ship/Don't ship/Collect more data] because [one sentence justification].

**Key caveat**: [One sentence on the most important limitation or uncertainty].

Example:

Result: The new checkout flow increased conversion rate by 3.2% (95% CI: 1.1% to 5.3%, $p < 0.01$ ).

Recommendation: Ship to all users based on significant lift exceeding our 2% MDE threshold.

Key caveat: Revenue per user showed no significant change; monitor post-launch.

Part 2: Quantify Uncertainty

Always Report Confidence Intervals

import numpy as np
from scipy import stats


def format_result(estimate, ci_lower, ci_upper, p_value, metric_name):
    """
    Format a result with proper uncertainty reporting.
    """
    # Determine significance
    significant = ci_lower > 0 or ci_upper < 0

    # Format the result
    result = f"{metric_name}: {estimate:+.1%} (95% CI: {ci_lower:+.1%} to {ci_upper:+.1%})"

    if p_value is not None:
        result += f", p = {p_value:.3f}"

    if significant:
        result += " *"

    return result


# Example
results = [
    ("Conversion Rate", 0.032, 0.011, 0.053, 0.003),
    ("Revenue per User", 0.018, -0.012, 0.048, 0.24),
    ("Page Load Time", -0.08, -0.15, -0.01, 0.02),
]

print("Experiment Results")
print("=" * 60)
for metric, est, ci_l, ci_u, p in results:
    print(format_result(est, ci_l, ci_u, p, metric))
print("\n* Statistically significant at α = 0.05")

Avoid These Patterns

❌ Bad: "Conversion increased by 3.2%" ✅ Good: "Conversion increased by 3.2% (95% CI: 1.1% to 5.3%)"

❌ Bad: " $p < 0.05$ , so it's significant" ✅ Good: " $p = 0.003$ , CI excludes zero, effect size exceeds our MDE"

❌ Bad: "The experiment was successful" ✅ Good: "The experiment detected a significant positive effect on conversion; revenue was inconclusive"

Part 3: Pre-Registration

Why Pre-Register

Pre-registration documents your analysis plan before seeing results:

Prevents accusations of p-hacking
Forces clear thinking about metrics and methods
Creates accountability
Builds credibility over time

Lightweight Pre-Registration Template

# Experiment Pre-Analysis Plan

## Experiment Info
- **Name**: New Checkout Flow v2
- **Start Date**: 2026-01-15
- **Planned End Date**: 2026-01-29 (or when powered)
- **Author**: [Your name]
- **Date Written**: 2026-01-14

## Hypothesis
The simplified checkout flow will increase conversion rate by reducing friction.

## Primary Metric
- **Metric**: Purchase conversion rate (purchases / sessions with cart)
- **MDE**: 2% relative lift
- **Analysis**: Two-proportion z-test, α = 0.05

## Secondary Metrics
1. Revenue per user
2. Cart abandonment rate
3. Time to purchase

## Guardrail Metrics
1. Page errors (should not increase)
2. Support tickets (should not increase)

## Sample Size
- Required: 50,000 per group (calculated at 80% power)
- Decision rule: Run until powered or 14 days, whichever comes first

## Analysis Plan
1. Check for SRM before analyzing results
2. Report ITT (intent-to-treat) as primary analysis
3. Segment analysis: new vs. returning users (pre-specified)

## What Constitutes Success
- Primary metric CI lower bound > 0
- No guardrail metric degradation

What to Lock Down

Always Pre-Specify	Can Be Exploratory
Primary metric	Additional segments
Sample size / stopping rule	Secondary correlations
Success criteria	Interaction effects
Guardrails	Post-hoc investigations
Key segments	Subgroup deep-dives

Part 4: Handle Inconclusive Results

When Results Are Inconclusive

An experiment is inconclusive when:

CI includes both positive and negative effects of interest
Sample size was insufficient
External factors corrupted the experiment

How to Report Inconclusive

## Result: Inconclusive

The experiment did not reach a conclusive result.

### What We Observed
- Conversion: +2.1% (95% CI: -1.8% to +6.0%)
- The point estimate is positive but the CI includes our MDE of 2%
  and zero

### Why Inconclusive
- Achieved 65% of planned sample size due to early feature freeze
- Would need ~3 more weeks to achieve 80% power

### Recommendation
**Option A**: Extend experiment for 3 weeks
- Pro: Conclusive result
- Con: Delays roadmap

**Option B**: Ship based on directionally positive results
- Pro: Ship now
- Con: ~30% chance the true effect is negative

**Option C**: Abandon and move on
- Pro: Free up resources
- Con: Miss potential 2-6% lift

The Courage to Say "I Don't Know"

Temptation	Better Response
"Results were positive" (but not significant)	"Results were directionally positive but inconclusive"
"No significant difference" (framing as no effect)	"We could not detect a difference with available data"
"The treatment works" (based on $p = 0.06$ )	"Suggestive evidence but below significance threshold"

Part 5: Common Mistakes and How to Avoid Them

Mistake 1: Multiple Testing Without Correction

You test 10 metrics. None have real effects. At α = 0.05, expected false positives: 0.5. Probability of at least one false positive: 40%.

Solution: Pre-specify a primary metric, correct for others, or clearly label as "exploratory."

Mistake 2: Cherry-Picking Segments

The Pattern:

Overall result is flat
Find a segment where treatment wins
Report segment result as main finding

The Fix:

Pre-specify segments of interest
Report all pre-specified segments, not just winners
Label post-hoc findings as "exploratory"

Mistake 3: Ignoring Guardrails

Bad: "Conversion up 5%! Ship it!" Good: "Conversion up 5%, but revenue flat and support tickets up 20%. Investigate before shipping."

Mistake 4: Over-Interpreting Segment Differences

You observe:

Mobile: +5% (significant)
Desktop: +2% (not significant)

Wrong conclusion: "Treatment works better on mobile."

Right approach: Test the interaction — is the difference between 5% and 2% itself significant? Often, both segments are consistent with a ~3.5% overall effect.

Part 6: Documentation and Audit Trails

What to Document

# Analysis Log: Checkout Experiment

## Data Pull
- **Date**: 2026-01-30
- **Query**: `checkout_experiment_v2.sql`
- **Rows**: 145,892 sessions
- **Filters applied**: Excluded bot traffic (user_agent filter)

## Data Quality Checks
- [x] SRM check: Control 50.2%, Treatment 49.8% (p = 0.34, OK)
- [x] No duplicate user assignments
- [x] Exposure dates match experiment config

## Analysis Decisions
1. Excluded users with 0 sessions (n=234) — no exposure
2. Used ITT (intent-to-treat) — assigned, not exposed
3. Winsorized revenue at 99th percentile — 3 outliers

## Deviations from Pre-Analysis Plan
- Originally planned 14 days, extended to 16 due to holiday traffic
- Added post-hoc analysis of mobile vs desktop (labeled exploratory)

## Reviewer Notes
- [2026-01-31] @reviewer: Confirmed SRM check independently
- [2026-02-01] @analyst: Added sensitivity without Winsorization

Version Control for Analyses

# Good practices
git commit -m "Initial analysis: checkout experiment results"
git commit -m "Add sensitivity analysis per reviewer request"
git commit -m "Fix: correct date filter (was off by 1 day)"

# Tag final versions
git tag -a "checkout-exp-final" -m "Final analysis shared with leadership"

Part 7: Presentation Strategies

Know Your Audience

Audience	They Care About	Give Them
Executives	Decision, impact	1-slide summary, recommendation
PMs	Details, tradeoffs	Full report, segment breakdowns
Data scientists	Methods, rigor	Appendix, code, methodology
Engineers	Implementation	What to build, edge cases

One-Slide Summary Template

┌─────────────────────────────────────────────────────────────┐
│  EXPERIMENT: [Name]                          [Date]          │
├─────────────────────────────────────────────────────────────┤
│  RESULT: [Win / Loss / Inconclusive]                        │
│                                                             │
│  PRIMARY METRIC: [Name]                                     │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Treatment: [X]%    Control: [Y]%    Δ: [Z]%        │   │
│  │  95% CI: [A]% to [B]%    p = [C]                    │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  GUARDRAILS: ✓ No degradation detected                     │
│                                                             │
│  RECOMMENDATION: [Ship / Don't Ship / Extend]               │
│  CONFIDENCE: [High / Medium / Low]                          │
│                                                             │
│  KEY CAVEAT: [One sentence]                                 │
└─────────────────────────────────────────────────────────────┘

Part 8: Building Long-Term Credibility

The Credibility Flywheel

Pre-register analysis
       ↓
Run analysis as planned
       ↓
Report honestly (including negatives)
       ↓
Welcome scrutiny
       ↓
Iterate and improve methods
       ↓
Build reputation for integrity
       ↓
Future analyses trusted more easily
       ↓
(Back to pre-register)

Practices That Build Trust

Admit mistakes quickly: "I found an error in yesterday's analysis..."
Show your uncertainty: "I'm 60% confident in this interpretation"
Anticipate objections: "You might ask about X—here's what I found"
Credit others: "Building on [colleague]'s earlier analysis..."
Follow up on predictions: "Last quarter I predicted X; actual was Y"

Specific Topics

One-Slide Experiment Readout Template - Presentation format
Common Analyst Mistakes - What to avoid
Writing Methods Sections - Documentation guide
Communicating Uncertainty - Stakeholder communication

Process

Pre-Registration Lite - Practical pre-reg
Audit Trails - Documentation practices
When to Say Inconclusive - Decision rules
Experiment Guardrails - Protecting quality

Key Takeaway

Analytics reporting succeeds through transparency, not persuasion. Lead with your finding and recommendation, quantify uncertainty with confidence intervals, pre-register your analysis plan, document your decisions, and have the courage to report inconclusive results honestly. The goal isn't to survive one review—it's to build a reputation where your analyses are trusted because you've consistently been honest about what the data shows and doesn't show.

References

https://www.exp-platform.com/Documents/2013-02-WSDM-DeltaMethodPaper.pdf
https://arxiv.org/abs/1903.06372
https://www.microsoft.com/en-us/research/publication/top-challenges-from-the-first-practical-online-controlled-experiments-summit/
Kohavi, R., Tang, D., & Xu, Y. (2020). *Trustworthy Online Controlled Experiments*. Cambridge University Press.
Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments. *WSDM*, 123-132.
Fabijan, A., Dmitriev, P., Olsson, H. H., & Bosch, J. (2017). The benefits of controlled experimentation at scale. *ICSE-SEIP*, 137-146.

Frequently Asked Questions

How much methodology should I include?

Enough for someone to reproduce your analysis, but keep it out of the main narrative. Lead with findings, put methodology in an appendix or 'Methods' section. Executives want conclusions; technical reviewers want details. Serve both.

What if my results are inconclusive?

Say so clearly. 'We observed a 3% lift but cannot rule out no effect (95% CI: -2% to 8%). Recommend: collect more data or make a decision under uncertainty.' Inconclusive is a valid finding—forcing a false narrative destroys trust faster than admitting uncertainty.

How do I handle pushback on methodology?

Welcome it as an opportunity to build trust. Have your pre-analysis plan ready, show sensitivity analyses, explain your choices. The goal isn't to 'win'—it's to reach correct conclusions together. If pushback reveals a flaw, acknowledge it and adjust.

Key Takeaway

Good analytics reporting builds trust through transparency, not persuasion. Lead with findings, quantify uncertainty, pre-register your approach, document your assumptions, and have the courage to say 'inconclusive' when appropriate. The goal isn't to survive one review—it's to build a reputation where your analyses are trusted because you've consistently been honest about what you know and don't know.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email