Contents
How to Write a Methods Section for Internal Docs That's Actually Copy-Ready
Templates and examples for writing clear, reproducible methods sections. Document your analysis so future you (and your colleagues) can understand and replicate it.
Quick Hits
- •Write methods so someone can reproduce your analysis without asking you questions
- •Include: data source, filters, metric definitions, statistical test, sample size
- •Specify software versions and link to code when possible
- •Document deviations from the pre-analysis plan and why
- •Future you will thank present you—write it when it's fresh
TL;DR
A methods section should enable reproduction without clarifying questions. Include: data source and extraction date, inclusion/exclusion criteria, metric definitions, statistical test and parameters, sample sizes, and deviations from plan. Write it while details are fresh. Future you will forget which users you excluded and why—document it now.
Why Methods Documentation Matters
The Scenario You Want to Avoid
6 months later...
Stakeholder: "Can you re-run last year's checkout experiment analysis?"
You: *Opens old doc*
Old doc: "We analyzed conversion rate and found a 3.2% lift (p < 0.05)"
You: "Which users did I include? What date range? Did I filter bots?
What did I count as a conversion? Why is the sample size
different from what I remember?"
*Spends 2 days reverse-engineering own analysis*
What Good Documentation Provides
| Without Documentation | With Documentation |
|---|---|
| "We found a 3.2% lift" | Full reproducibility |
| Reverse-engineering required | Open doc, run query |
| Inconsistent re-analyses | Identical reproductions |
| Trust erosion over time | Auditable decisions |
The Methods Section Template
Complete Template
## Methods
### Data Source
- **Database/Table**: [e.g., analytics.experiment_assignments]
- **Extraction Date**: [e.g., 2026-01-28]
- **Date Range Analyzed**: [e.g., Jan 15-28, 2026]
- **Query/Code Reference**: [link to SQL/notebook]
### Sample Definition
- **Unit of Analysis**: [e.g., unique users, sessions, transactions]
- **Population**: [e.g., all users visiting the homepage]
- **Inclusion Criteria**:
- [Criterion 1]
- [Criterion 2]
- **Exclusion Criteria**:
- [Criterion 1 with rationale]
- [Criterion 2 with rationale]
- **Final Sample Size**: [n per group]
### Metric Definitions
- **Primary Metric**: [name]
- **Numerator**: [exact definition]
- **Denominator**: [exact definition]
- **Formula**: [explicit formula]
### Statistical Analysis
- **Test**: [e.g., two-sample proportion z-test]
- **Significance Level**: [e.g., α = 0.05, two-tailed]
- **Software**: [e.g., Python 3.9, scipy 1.7.3]
- **Assumptions Checked**: [list]
### Deviations from Pre-Analysis Plan
- [Any changes, with rationale]
### Limitations
- [Known issues or caveats]
Section-by-Section Guidance
Data Source
Bad Example:
Data was pulled from our analytics database.
Good Example:
### Data Source
- **Database**: analytics_prod.experiment_exposures
- **Extraction Date**: 2026-01-28 09:15 UTC
- **Date Range**: Users exposed Jan 15-28, 2026
- **Query**: experiments/checkout_v2/analysis.sql (commit abc123)
- **Data Freshness**: Production tables, 24-hour lag
Sample Definition
def document_sample_definition():
"""
Example of thorough sample documentation.
"""
sample_docs = """
### Sample Definition
**Unit of Analysis**: User (deduplicated by user_id)
**Population**: All logged-in users who visited the checkout page
during the experiment period
**Inclusion Criteria**:
- Exposed to experiment (has assignment record in experiment_exposures)
- Visited checkout page at least once during exposure window
- Has valid user_id (not null, not test account)
**Exclusion Criteria**:
- Bot traffic: user_agent matches known bot patterns (see bot_filter.py)
Rationale: Bots don't convert, inflate denominator artificially
Excluded: 2,341 users (1.8%)
- Internal employees: email domain = @company.com
Rationale: Different behavior than real users
Excluded: 156 users (0.1%)
- Users with >100 sessions/day: Anomalous behavior
Rationale: Likely automated or testing accounts
Excluded: 23 users (<0.1%)
**Final Sample**:
- Control: 64,521 users
- Treatment: 64,892 users
- Total exclusions: 2,520 (1.9%)
"""
print(sample_docs)
document_sample_definition()
Metric Definitions
Critical: Be obsessively specific.
### Primary Metric: Purchase Conversion Rate
**Definition**: The proportion of exposed users who completed at least
one purchase during their exposure window.
**Numerator**: Count of unique users with at least one transaction
where transaction_type = 'purchase' AND transaction_timestamp >
first_exposure_timestamp AND transaction_timestamp <
first_exposure_timestamp + 7 days
**Denominator**: Count of unique users in the experiment sample
**Formula**:
conversion_rate = (users_with_purchase / total_users) × 100
**Notes**:
- Attribution window: 7 days from first exposure
- A user with multiple purchases counts once in the numerator
- Refunded transactions are INCLUDED (purchase intent captured)
Statistical Analysis
### Statistical Analysis
**Primary Analysis**:
- Test: Two-sample proportion z-test (two-tailed)
- Null hypothesis: π_treatment = π_control
- Significance level: α = 0.05
- Minimum detectable effect: 2% relative lift (pre-specified)
**Confidence Interval**:
- Method: Wald interval for difference in proportions
- Formula: (p̂_t - p̂_c) ± z_{α/2} × SE
- SE = sqrt(p̂_t(1-p̂_t)/n_t + p̂_c(1-p̂_c)/n_c)
**Software**:
- Python 3.9.7
- scipy 1.7.3 (scipy.stats.proportions_ztest)
- Analysis notebook: analysis/checkout_v2_final.ipynb
**Assumptions Checked**:
- Sample size > 30 per group: ✓ (n > 64,000)
- Expected successes/failures > 5: ✓
- Independence: Users randomized individually ✓
- SRM check: p = 0.34 (no ratio mismatch)
Common Scenarios
Scenario 1: A/B Test with Continuous Metric
### Methods: Revenue Per User Analysis
**Data Source**:
- Table: analytics.transactions joined with analytics.experiment_exposures
- Date extracted: 2026-01-29
- Experiment period: 2026-01-15 to 2026-01-28
**Sample**:
- All users assigned to checkout_flow_v2 experiment
- Exclusions: Bots (2.1%), employees (0.1%), fraud flags (0.3%)
- Final: Control n=62,445, Treatment n=62,891
**Metric**: Revenue Per User (RPU)
- Sum of transaction_amount for each user during attribution window
- Attribution window: 14 days from assignment
- Users with no purchases contribute $0 to numerator
- Outlier handling: Winsorized at 99th percentile ($847)
- Rationale: 0.8% of users with orders >$847 were capping effect
- Pre-specified in analysis plan
**Analysis**:
- Test: Welch's t-test (unequal variances assumed)
- Confidence interval: Bootstrap BCa, 10,000 iterations
- α = 0.05, two-tailed
- Software: Python 3.9, scipy 1.7.3, statsmodels 0.13.2
Scenario 2: Model Comparison
### Methods: Classification Model Comparison
**Evaluation Dataset**:
- Source: eval_set_v3, labeled by 3 raters
- Date created: 2026-01-20
- n = 2,500 examples (500 positive, 2,000 negative)
- Class balance: 20% positive (matches production distribution)
**Models Compared**:
- Baseline: logistic_v2 (production model, deployed 2025-11-01)
- Candidate: transformer_v1 (trained 2026-01-15)
**Primary Metric**: Area Under ROC Curve (AUC)
- Computed using sklearn.metrics.roc_auc_score
- Averaged across 3 labeler judgments (majority vote for ties)
**Statistical Test**:
- DeLong's test for comparing correlated AUCs
- Implementation: pROC package via rpy2
- α = 0.05, two-tailed
**Secondary Metrics**:
- Precision at 90% recall
- F1 score at optimal threshold
- Note: Secondary metrics exploratory, not corrected for multiple comparisons
**Confidence Intervals**:
- Bootstrap, 2,000 iterations, BCa method
- Paired bootstrap (same examples in each resample)
Scenario 3: Survey Analysis
### Methods: User Satisfaction Survey
**Data Collection**:
- Survey deployed: 2026-01-20 to 2026-01-25
- Platform: SurveyMonkey, randomized question order
- Population: Users who completed a purchase in past 30 days
- Sampling: Random sample of 10,000 eligible users invited
**Response**:
- Invited: 10,000
- Responded: 1,847 (18.5% response rate)
- Complete responses: 1,702 (17.0% completion rate)
- Excluded: 145 partial responses (missing primary question)
**Measure**: Net Promoter Score (NPS)
- Question: "How likely are you to recommend [Product] to a friend?"
- Scale: 0-10
- NPS = % Promoters (9-10) - % Detractors (0-6)
**Analysis**:
- NPS comparison: Bootstrap confidence interval (10,000 iterations)
- Demographic comparisons: Chi-square tests with Holm correction
- α = 0.05 for primary analysis
**Limitations**:
- Self-selection bias: Respondents may differ from non-respondents
- Recency bias: Survey sent within 7 days of purchase
- No control group: Pre-post comparison only
Deviations Documentation
When You Change the Plan
### Deviations from Pre-Analysis Plan
**Deviation 1: Extended experiment duration**
- Original plan: 14 days
- Actual: 16 days
- Reason: Holiday weekend (Jan 18-19) caused traffic spike; extended to
ensure representative sample
- Impact: Sample size increased by ~15%; should improve precision
**Deviation 2: Added post-hoc segment analysis**
- Original plan: Overall analysis only
- Added: Mobile vs Desktop comparison
- Reason: Stakeholder request after seeing flat overall result
- Status: Labeled as EXPLORATORY in findings
- Note: No significant interaction detected (p = 0.23)
**Deviation 3: Changed outlier threshold**
- Original plan: Winsorize at 99th percentile
- Actual: Winsorize at 97th percentile
- Reason: 99th percentile was $2,847 due to one enterprise order
- Impact: Sensitivity analysis shows results consistent at both thresholds
Code Documentation Standards
Linking Code to Methods
### Code References
**Main Analysis**:
- Repository: github.com/company/experiments
- Path: checkout_v2/analysis/main_analysis.py
- Commit: abc123def (tagged: checkout-v2-final)
**Key Functions**:
- `compute_conversion_rate()`: Primary metric calculation
- `run_proportion_test()`: Statistical test (lines 145-180)
- `bootstrap_ci()`: Confidence interval computation
**Data Pipeline**:
- Query: checkout_v2/sql/extract_experiment_data.sql
- Pre-processing: checkout_v2/scripts/clean_data.py
**To Reproduce**:
```bash
git checkout checkout-v2-final
python -m checkout_v2.analysis.main_analysis \
--config config/checkout_v2.yaml \
--output results/
### Version Pinning
```markdown
### Software Environment
**Python Version**: 3.9.7
**Key Package Versions**:
- pandas==1.4.2
- scipy==1.7.3
- statsmodels==0.13.2
- numpy==1.22.3
**Full Environment**:
- requirements.txt: checkout_v2/requirements.txt
- Frozen: checkout_v2/requirements-frozen.txt (pip freeze output)
Template Generator
def generate_methods_template(analysis_type='ab_test'):
"""
Generate a methods section template for common analysis types.
"""
templates = {
'ab_test': """
## Methods
### Data Source
- Database/Table:
- Extraction Date:
- Date Range:
- Query Reference:
### Sample Definition
- Unit of Analysis:
- Population:
- Inclusion Criteria:
- Exclusion Criteria (with rationale and counts):
- Final Sample Size (per group):
### Metric Definitions
Primary Metric: [Name]
- Numerator:
- Denominator:
- Attribution window:
- Special handling (outliers, zeros):
### Statistical Analysis
- Test:
- Significance Level:
- Confidence Interval Method:
- Assumptions Checked:
- Software and Versions:
### Deviations from Pre-Analysis Plan
- (List any deviations with rationale)
### Limitations
- (Known issues or caveats)
""",
'model_eval': """
## Methods
### Evaluation Dataset
- Source:
- Date Created:
- n Examples:
- Class Distribution:
- Label Source (human/automated):
### Models Compared
- Baseline:
- Candidate:
### Primary Metric
- Metric Name:
- Computation Method:
- Handling of Edge Cases:
### Statistical Test
- Test Name:
- Null Hypothesis:
- Significance Level:
- Implementation:
### Confidence Intervals
- Method:
- Number of Iterations:
- Type (paired/unpaired):
### Secondary Metrics (Exploratory)
- (List with note on multiple comparison status)
### Limitations
- (Known issues)
"""
}
print(f"Methods Template for {analysis_type.upper()}")
print("=" * 50)
print(templates.get(analysis_type, templates['ab_test']))
generate_methods_template('ab_test')
R Implementation
# Function to generate methods documentation
generate_methods_doc <- function(
data_source,
date_extracted,
sample_size_control,
sample_size_treatment,
metric_name,
test_type,
alpha = 0.05,
deviations = NULL
) {
cat("## Methods\n\n")
cat("### Data Source\n")
cat("- Source:", data_source, "\n")
cat("- Date extracted:", date_extracted, "\n\n")
cat("### Sample\n")
cat("- Control:", sample_size_control, "\n")
cat("- Treatment:", sample_size_treatment, "\n\n")
cat("### Primary Metric\n")
cat("- Metric:", metric_name, "\n\n")
cat("### Statistical Analysis\n")
cat("- Test:", test_type, "\n")
cat("- Significance level: α =", alpha, "\n")
cat("- Software: R", R.version.string, "\n\n")
if (!is.null(deviations)) {
cat("### Deviations from Plan\n")
for (d in deviations) {
cat("-", d, "\n")
}
}
}
# Example usage
generate_methods_doc(
data_source = "analytics.experiment_data",
date_extracted = "2026-01-28",
sample_size_control = 64521,
sample_size_treatment = 64892,
metric_name = "Purchase conversion rate",
test_type = "Two-sample proportion z-test",
deviations = c("Extended duration from 14 to 16 days")
)
Checklist Before Submitting
Methods Section Completeness Checklist:
□ Data source and extraction date specified?
□ Date range of analysis clear?
□ Query/code linked and versioned?
□ All inclusion criteria listed?
□ All exclusion criteria listed with rationale and counts?
□ Final sample sizes stated?
□ Primary metric formula explicit (not just name)?
□ Attribution windows specified?
□ Outlier handling documented?
□ Statistical test named?
□ Assumptions listed and checked?
□ Significance level stated?
□ Software versions specified?
□ Deviations from pre-analysis plan documented?
□ Limitations acknowledged?
□ Could someone reproduce this without asking me questions?
Related Articles
- Analytics Reporting (Pillar) - Complete reporting guide
- Pre-Registration Lite - Pre-analysis planning
- Audit Trails - Documentation practices
- Common Analyst Mistakes - What to avoid
Key Takeaway
A good methods section lets someone reproduce your analysis without asking you a single question. Include your data source with extraction date, inclusion/exclusion criteria with counts and rationale, exact metric definitions (not just names), statistical test with parameters, sample sizes, software versions, and any deviations from your pre-analysis plan. Write it while the details are fresh—six months from now, you won't remember why you excluded those 2,341 users or what threshold you used for outliers. Document for your future self, and you'll document well for everyone else too.
References
- https://www.equator-network.org/reporting-guidelines/
- https://arxiv.org/abs/1903.06372
- EQUATOR Network. (2023). Enhancing the QUAlity and Transparency Of health Research. Retrieved from equator-network.org
- Nosek, B. A., et al. (2019). Preregistration is hard, and worthwhile. *Trends in Cognitive Sciences*, 23(10), 815-818.
- Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. *American Psychologist*, 54(8), 594-604.
Frequently Asked Questions
How much detail is too much?
Should I include code in the methods?
What if I changed my approach mid-analysis?
Key Takeaway
A good methods section lets someone reproduce your analysis without asking you a single question. Include your data source and query date, inclusion/exclusion criteria, exact metric definitions, the statistical test with parameters, sample sizes, and any deviations from your pre-analysis plan. Write it while the details are fresh—six months from now, you won't remember why you excluded those users or what threshold you used for outliers. Document for your future self, and you'll document well for everyone else too.