Reporting

Analytics Reporting That Doesn't Get You Killed in Review

How to communicate statistical results to stakeholders without getting destroyed in review. Templates, common mistakes, and strategies for building trust through transparency.

Share

Quick Hits

  • Lead with the answer, not the methodology
  • Report confidence intervals, not just point estimates
  • Pre-register your analysis plan to avoid accusations of p-hacking
  • Know when to say 'inconclusive' instead of forcing a narrative
  • Document everything—future you will thank present you

TL;DR

Analytics reporting is about building trust, not winning arguments. Lead with findings (not methods), quantify uncertainty (confidence intervals, not just point estimates), pre-register your analysis plan (avoid p-hacking accusations), and have the courage to report inconclusive results honestly. This guide covers templates, common mistakes, and strategies for creating reports that survive scrutiny and build your credibility over time.


The Stakes

What Happens in Review

Your analysis will be questioned on:

  • Methodology: Did you analyze this correctly?
  • Integrity: Did you p-hack or cherry-pick?
  • Completeness: Did you look at what matters?
  • Interpretation: Are your conclusions justified?

What Builds Trust

Destroys Trust Builds Trust
Hiding uncertainty Quantifying uncertainty
Post-hoc rationalization Pre-registered plans
Overstating conclusions Measured, qualified claims
Missing obvious questions Anticipating objections
Defensive reactions Welcoming scrutiny

Part 1: Structure Your Reports

The Inverted Pyramid

┌─────────────────────────────────────────┐
│           EXECUTIVE SUMMARY              │  ← Answer first
│         (1-2 sentences, key finding)     │
├─────────────────────────────────────────┤
│              KEY METRICS                 │  ← The numbers
│     (with confidence intervals)          │
├─────────────────────────────────────────┤
│         SUPPORTING ANALYSIS              │  ← Context and nuance
│   (segments, sensitivity, guardrails)    │
├─────────────────────────────────────────┤
│            METHODOLOGY                   │  ← For the skeptics
│     (how you did it, assumptions)        │
├─────────────────────────────────────────┤
│              APPENDIX                    │  ← Deep details
│    (data checks, full tables, code)      │
└─────────────────────────────────────────┘

Executive Summary Template

## Executive Summary

**Result**: [Treatment/Feature/Change] [increased/decreased/had no significant effect on]
[metric] by [X]% (95% CI: [Y]% to [Z]%).

**Recommendation**: [Ship/Don't ship/Collect more data] because [one sentence justification].

**Key caveat**: [One sentence on the most important limitation or uncertainty].

Example:

Result: The new checkout flow increased conversion rate by 3.2% (95% CI: 1.1% to 5.3%, p < 0.01).

Recommendation: Ship to all users based on significant lift exceeding our 2% MDE threshold.

Key caveat: Revenue per user showed no significant change; monitor post-launch.


Part 2: Quantify Uncertainty

Always Report Confidence Intervals

import numpy as np
from scipy import stats


def format_result(estimate, ci_lower, ci_upper, p_value, metric_name):
    """
    Format a result with proper uncertainty reporting.
    """
    # Determine significance
    significant = ci_lower > 0 or ci_upper < 0

    # Format the result
    result = f"{metric_name}: {estimate:+.1%} (95% CI: {ci_lower:+.1%} to {ci_upper:+.1%})"

    if p_value is not None:
        result += f", p = {p_value:.3f}"

    if significant:
        result += " *"

    return result


# Example
results = [
    ("Conversion Rate", 0.032, 0.011, 0.053, 0.003),
    ("Revenue per User", 0.018, -0.012, 0.048, 0.24),
    ("Page Load Time", -0.08, -0.15, -0.01, 0.02),
]

print("Experiment Results")
print("=" * 60)
for metric, est, ci_l, ci_u, p in results:
    print(format_result(est, ci_l, ci_u, p, metric))
print("\n* Statistically significant at α = 0.05")

Avoid These Patterns

Bad: "Conversion increased by 3.2%" ✅ Good: "Conversion increased by 3.2% (95% CI: 1.1% to 5.3%)"

Bad: "p < 0.05, so it's significant" ✅ Good: "p = 0.003, CI excludes zero, effect size exceeds our MDE"

Bad: "The experiment was successful" ✅ Good: "The experiment detected a significant positive effect on conversion; revenue was inconclusive"


Part 3: Pre-Registration

Why Pre-Register

Pre-registration documents your analysis plan before seeing results:

  • Prevents accusations of p-hacking
  • Forces clear thinking about metrics and methods
  • Creates accountability
  • Builds credibility over time

Lightweight Pre-Registration Template

# Experiment Pre-Analysis Plan

## Experiment Info
- **Name**: New Checkout Flow v2
- **Start Date**: 2026-01-15
- **Planned End Date**: 2026-01-29 (or when powered)
- **Author**: [Your name]
- **Date Written**: 2026-01-14

## Hypothesis
The simplified checkout flow will increase conversion rate by reducing friction.

## Primary Metric
- **Metric**: Purchase conversion rate (purchases / sessions with cart)
- **MDE**: 2% relative lift
- **Analysis**: Two-proportion z-test, α = 0.05

## Secondary Metrics
1. Revenue per user
2. Cart abandonment rate
3. Time to purchase

## Guardrail Metrics
1. Page errors (should not increase)
2. Support tickets (should not increase)

## Sample Size
- Required: 50,000 per group (calculated at 80% power)
- Decision rule: Run until powered or 14 days, whichever comes first

## Analysis Plan
1. Check for SRM before analyzing results
2. Report ITT (intent-to-treat) as primary analysis
3. Segment analysis: new vs. returning users (pre-specified)

## What Constitutes Success
- Primary metric CI lower bound > 0
- No guardrail metric degradation

What to Lock Down

Always Pre-Specify Can Be Exploratory
Primary metric Additional segments
Sample size / stopping rule Secondary correlations
Success criteria Interaction effects
Guardrails Post-hoc investigations
Key segments Subgroup deep-dives

Part 4: Handle Inconclusive Results

When Results Are Inconclusive

An experiment is inconclusive when:

  • CI includes both positive and negative effects of interest
  • Sample size was insufficient
  • External factors corrupted the experiment

How to Report Inconclusive

## Result: Inconclusive

The experiment did not reach a conclusive result.

### What We Observed
- Conversion: +2.1% (95% CI: -1.8% to +6.0%)
- The point estimate is positive but the CI includes our MDE of 2%
  and zero

### Why Inconclusive
- Achieved 65% of planned sample size due to early feature freeze
- Would need ~3 more weeks to achieve 80% power

### Recommendation
**Option A**: Extend experiment for 3 weeks
- Pro: Conclusive result
- Con: Delays roadmap

**Option B**: Ship based on directionally positive results
- Pro: Ship now
- Con: ~30% chance the true effect is negative

**Option C**: Abandon and move on
- Pro: Free up resources
- Con: Miss potential 2-6% lift

The Courage to Say "I Don't Know"

Temptation Better Response
"Results were positive" (but not significant) "Results were directionally positive but inconclusive"
"No significant difference" (framing as no effect) "We could not detect a difference with available data"
"The treatment works" (based on p = 0.06) "Suggestive evidence but below significance threshold"

Part 5: Common Mistakes and How to Avoid Them

Mistake 1: Multiple Testing Without Correction

def demonstrate_multiple_testing():
    """
    Show how uncorrected multiple testing inflates false positives.
    """
    print("Multiple Testing Problem")
    print("=" * 50)
    print("\nYou test 10 metrics. None have real effects.")
    print("At α = 0.05, expected false positives: 0.5")
    print("Probability of at least one false positive: 40%")
    print("\nSolution: Pre-specify primary metric, correct for others")
    print("         Or clearly label as 'exploratory'")


demonstrate_multiple_testing()

Mistake 2: Cherry-Picking Segments

The Pattern:

  1. Overall result is flat
  2. Find a segment where treatment wins
  3. Report segment result as main finding

The Fix:

  • Pre-specify segments of interest
  • Report all pre-specified segments, not just winners
  • Label post-hoc findings as "exploratory"

Mistake 3: Ignoring Guardrails

Bad: "Conversion up 5%! Ship it!" Good: "Conversion up 5%, but revenue flat and support tickets up 20%. Investigate before shipping."

Mistake 4: Over-Interpreting Segment Differences

def segment_comparison_warning():
    """
    Comparing segments requires testing the interaction.
    """
    print("Segment Comparison Warning")
    print("=" * 50)
    print("\nYou observe:")
    print("  Mobile: +5% (significant)")
    print("  Desktop: +2% (not significant)")
    print("\n❌ Wrong conclusion: 'Treatment works better on mobile'")
    print("✅ Right approach: Test interaction (is 5% vs 2% significant?)")
    print("   Often, both are consistent with a ~3.5% overall effect")


segment_comparison_warning()

Part 6: Documentation and Audit Trails

What to Document

# Analysis Log: Checkout Experiment

## Data Pull
- **Date**: 2026-01-30
- **Query**: `checkout_experiment_v2.sql`
- **Rows**: 145,892 sessions
- **Filters applied**: Excluded bot traffic (user_agent filter)

## Data Quality Checks
- [x] SRM check: Control 50.2%, Treatment 49.8% (p = 0.34, OK)
- [x] No duplicate user assignments
- [x] Exposure dates match experiment config

## Analysis Decisions
1. Excluded users with 0 sessions (n=234) — no exposure
2. Used ITT (intent-to-treat) — assigned, not exposed
3. Winsorized revenue at 99th percentile — 3 outliers

## Deviations from Pre-Analysis Plan
- Originally planned 14 days, extended to 16 due to holiday traffic
- Added post-hoc analysis of mobile vs desktop (labeled exploratory)

## Reviewer Notes
- [2026-01-31] @reviewer: Confirmed SRM check independently
- [2026-02-01] @analyst: Added sensitivity without Winsorization

Version Control for Analyses

# Good practices
git commit -m "Initial analysis: checkout experiment results"
git commit -m "Add sensitivity analysis per reviewer request"
git commit -m "Fix: correct date filter (was off by 1 day)"

# Tag final versions
git tag -a "checkout-exp-final" -m "Final analysis shared with leadership"

Part 7: Presentation Strategies

Know Your Audience

Audience They Care About Give Them
Executives Decision, impact 1-slide summary, recommendation
PMs Details, tradeoffs Full report, segment breakdowns
Data scientists Methods, rigor Appendix, code, methodology
Engineers Implementation What to build, edge cases

One-Slide Summary Template

┌─────────────────────────────────────────────────────────────┐
│  EXPERIMENT: [Name]                          [Date]          │
├─────────────────────────────────────────────────────────────┤
│  RESULT: [Win / Loss / Inconclusive]                        │
│                                                             │
│  PRIMARY METRIC: [Name]                                     │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Treatment: [X]%    Control: [Y]%    Δ: [Z]%        │   │
│  │  95% CI: [A]% to [B]%    p = [C]                    │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  GUARDRAILS: ✓ No degradation detected                     │
│                                                             │
│  RECOMMENDATION: [Ship / Don't Ship / Extend]               │
│  CONFIDENCE: [High / Medium / Low]                          │
│                                                             │
│  KEY CAVEAT: [One sentence]                                 │
└─────────────────────────────────────────────────────────────┘

Part 8: Building Long-Term Credibility

The Credibility Flywheel

Pre-register analysis
       ↓
Run analysis as planned
       ↓
Report honestly (including negatives)
       ↓
Welcome scrutiny
       ↓
Iterate and improve methods
       ↓
Build reputation for integrity
       ↓
Future analyses trusted more easily
       ↓
(Back to pre-register)

Practices That Build Trust

  1. Admit mistakes quickly: "I found an error in yesterday's analysis..."
  2. Show your uncertainty: "I'm 60% confident in this interpretation"
  3. Anticipate objections: "You might ask about X—here's what I found"
  4. Credit others: "Building on [colleague]'s earlier analysis..."
  5. Follow up on predictions: "Last quarter I predicted X; actual was Y"

Specific Topics

Process


Key Takeaway

Analytics reporting succeeds through transparency, not persuasion. Lead with your finding and recommendation, quantify uncertainty with confidence intervals, pre-register your analysis plan, document your decisions, and have the courage to report inconclusive results honestly. The goal isn't to survive one review—it's to build a reputation where your analyses are trusted because you've consistently been honest about what the data shows and doesn't show.


References

  1. https://www.exp-platform.com/Documents/2013-02-WSDM-DeltaMethodPaper.pdf
  2. https://arxiv.org/abs/1903.06372
  3. https://www.microsoft.com/en-us/research/publication/top-challenges-from-the-first-practical-online-controlled-experiments-summit/
  4. Kohavi, R., Tang, D., & Xu, Y. (2020). *Trustworthy Online Controlled Experiments*. Cambridge University Press.
  5. Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments. *WSDM*, 123-132.
  6. Fabijan, A., Dmitriev, P., Olsson, H. H., & Bosch, J. (2017). The benefits of controlled experimentation at scale. *ICSE-SEIP*, 137-146.

Frequently Asked Questions

How much methodology should I include?
Enough for someone to reproduce your analysis, but keep it out of the main narrative. Lead with findings, put methodology in an appendix or 'Methods' section. Executives want conclusions; technical reviewers want details. Serve both.
What if my results are inconclusive?
Say so clearly. 'We observed a 3% lift but cannot rule out no effect (95% CI: -2% to 8%). Recommend: collect more data or make a decision under uncertainty.' Inconclusive is a valid finding—forcing a false narrative destroys trust faster than admitting uncertainty.
How do I handle pushback on methodology?
Welcome it as an opportunity to build trust. Have your pre-analysis plan ready, show sensitivity analyses, explain your choices. The goal isn't to 'win'—it's to reach correct conclusions together. If pushback reveals a flaw, acknowledge it and adjust.

Key Takeaway

Good analytics reporting builds trust through transparency, not persuasion. Lead with findings, quantify uncertainty, pre-register your approach, document your assumptions, and have the courage to say 'inconclusive' when appropriate. The goal isn't to survive one review—it's to build a reputation where your analyses are trusted because you've consistently been honest about what you know and don't know.

Send to a friend

Share this with someone who loves clean statistical work.