Contents
Ratio Metrics (CTR, Conversion): Why They're Tricky and Stable Alternatives
Why ratio metrics like CTR and conversion rates require special statistical treatment. Learn about variance estimation, the delta method, and when to use alternative approaches.
Quick Hits
- •Ratio metrics (CTR = clicks/impressions) have correlated numerator and denominator
- •Standard variance formulas assume independence—they're wrong for ratios
- •Delta method gives correct variance when numerator and denominator are correlated
- •User-level metrics (clicks per user, conversions per user) avoid ratio complications
- •Different randomization units need different variance calculations
TL;DR
Ratio metrics like CTR (clicks/impressions) and conversion rate (purchases/visitors) are ubiquitous in product analytics. But their variance estimation is trickier than it looks—the numerator and denominator are correlated, breaking standard formulas. The delta method provides correct standard errors. An often simpler alternative: use user-level metrics (clicks per user) instead of ratio metrics (total clicks / total impressions). User-level metrics have straightforward variance, align with experiment randomization, and avoid ratio pitfalls.
The Problem with Ratio Metrics
What Are Ratio Metrics?
$$\text{CTR} = \frac{\sum_{i} \text{clicks}i}{\sum{i} \text{impressions}_i} = \frac{Y}{X}$$
$$\text{Conversion Rate} = \frac{\text{Conversions}}{\text{Visitors}}$$
Why Standard Variance Fails
Naive approach: Treat ratio as binomial $$\text{Var}(\hat{p}) = \frac{p(1-p)}{n}$$
Problem: This assumes:
- Each trial (impression) is independent
- Each trial has same probability
Reality:
- Users have different click propensities
- Users with more impressions contribute more to both numerator AND denominator
- Numerator and denominator are positively correlated
Illustration
import numpy as np
from scipy import stats
def demonstrate_ratio_variance_problem():
"""
Show how naive variance estimation fails for ratio metrics.
"""
np.random.seed(42)
# Simulate users with varying impressions and click rates
n_users = 1000
n_simulations = 500
# User-level parameters
base_impressions = np.random.exponential(50, n_users) # Heavy user variation
user_ctr = np.random.beta(2, 20, n_users) # Individual CTRs ~0.1
# Simulation: compute CTR many times with sampling noise
ctrs_actual = []
ctrs_naive_se = []
for _ in range(n_simulations):
# Sample impressions and clicks for each user
impressions = np.random.poisson(base_impressions)
clicks = np.random.binomial(impressions, user_ctr)
total_clicks = clicks.sum()
total_impressions = impressions.sum()
ctr = total_clicks / total_impressions
# Naive SE (binomial assumption)
naive_se = np.sqrt(ctr * (1 - ctr) / total_impressions)
ctrs_actual.append(ctr)
ctrs_naive_se.append(naive_se)
# Compare
actual_se = np.std(ctrs_actual)
mean_naive_se = np.mean(ctrs_naive_se)
print("Ratio Metric Variance Problem")
print("=" * 50)
print(f"True SE of CTR (from simulations): {actual_se:.6f}")
print(f"Naive SE (binomial assumption): {mean_naive_se:.6f}")
print(f"Ratio (naive / true): {mean_naive_se / actual_se:.2f}")
print("\n⚠️ Naive SE underestimates true variability by ~50%!")
print(" This leads to inflated significance and false positives.")
demonstrate_ratio_variance_problem()
The Delta Method
The Formula
For ratio $R = Y/X$ where $Y$ is numerator sum and $X$ is denominator sum:
$$\text{Var}(R) \approx \frac{1}{\bar{X}^2} \left[ \text{Var}(\bar{Y}) - 2R \cdot \text{Cov}(\bar{Y}, \bar{X}) + R^2 \cdot \text{Var}(\bar{X}) \right]$$
Where:
- $\bar{Y}$ = mean of numerators (per user)
- $\bar{X}$ = mean of denominators (per user)
- $R = \bar{Y} / \bar{X}$ = the ratio
Implementation
import numpy as np
def delta_method_ratio_se(numerators, denominators):
"""
Compute standard error for ratio using delta method.
Parameters:
-----------
numerators : array
Per-unit numerator values (e.g., clicks per user)
denominators : array
Per-unit denominator values (e.g., impressions per user)
Returns:
--------
dict with ratio, SE, and CI
"""
n = len(numerators)
# Means
mean_y = np.mean(numerators)
mean_x = np.mean(denominators)
ratio = mean_y / mean_x
# Variances and covariance
var_y = np.var(numerators, ddof=1)
var_x = np.var(denominators, ddof=1)
cov_xy = np.cov(numerators, denominators, ddof=1)[0, 1]
# Delta method variance
var_ratio = (1 / mean_x**2) * (
var_y -
2 * ratio * cov_xy +
ratio**2 * var_x
) / n
se = np.sqrt(var_ratio)
return {
'ratio': ratio,
'se': se,
'ci_lower': ratio - 1.96 * se,
'ci_upper': ratio + 1.96 * se,
'var_y': var_y,
'var_x': var_x,
'cov_xy': cov_xy
}
def compare_variance_methods(numerators, denominators, n_bootstrap=2000):
"""
Compare delta method, naive, and bootstrap SE estimates.
"""
# Delta method
delta_result = delta_method_ratio_se(numerators, denominators)
# Naive (binomial-like)
total_y = np.sum(numerators)
total_x = np.sum(denominators)
ratio = total_y / total_x
naive_se = np.sqrt(ratio * (1 - ratio) / total_x)
# Bootstrap
n = len(numerators)
boot_ratios = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, n, replace=True)
boot_ratio = np.sum(numerators[idx]) / np.sum(denominators[idx])
boot_ratios.append(boot_ratio)
bootstrap_se = np.std(boot_ratios)
return {
'ratio': ratio,
'delta_se': delta_result['se'],
'naive_se': naive_se,
'bootstrap_se': bootstrap_se
}
# Example: CTR data
np.random.seed(42)
n_users = 2000
# User heterogeneity
user_impressions = np.random.poisson(50, n_users)
user_ctr = np.random.beta(2, 20, n_users)
user_clicks = np.random.binomial(user_impressions, user_ctr)
result = compare_variance_methods(user_clicks, user_impressions)
print("Variance Method Comparison for CTR")
print("=" * 50)
print(f"CTR: {result['ratio']:.4f}")
print(f"\nStandard Error Estimates:")
print(f" Delta method: {result['delta_se']:.6f}")
print(f" Bootstrap: {result['bootstrap_se']:.6f}")
print(f" Naive: {result['naive_se']:.6f}")
print(f"\nNaive underestimates by: {(1 - result['naive_se']/result['delta_se'])*100:.0f}%")
Two-Sample Comparison with Delta Method
def compare_ratios_delta(num_control, denom_control, num_treatment, denom_treatment):
"""
Compare ratio metrics between two groups using delta method.
Parameters:
-----------
num_control : array
Numerator per unit in control (e.g., clicks per user)
denom_control : array
Denominator per unit in control (e.g., impressions per user)
num_treatment : array
Numerator per unit in treatment
denom_treatment : array
Denominator per unit in treatment
Returns:
--------
dict with ratio estimates, difference, SE, and p-value
"""
# Control
ctrl_result = delta_method_ratio_se(num_control, denom_control)
# Treatment
treat_result = delta_method_ratio_se(num_treatment, denom_treatment)
# Difference
diff = treat_result['ratio'] - ctrl_result['ratio']
se_diff = np.sqrt(ctrl_result['se']**2 + treat_result['se']**2)
# Z-test
z_stat = diff / se_diff
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
# Relative lift
lift = diff / ctrl_result['ratio'] if ctrl_result['ratio'] > 0 else np.inf
return {
'control_ratio': ctrl_result['ratio'],
'control_se': ctrl_result['se'],
'treatment_ratio': treat_result['ratio'],
'treatment_se': treat_result['se'],
'difference': diff,
'se_diff': se_diff,
'lift': lift,
'z_stat': z_stat,
'p_value': p_value
}
# Example: A/B test on CTR
np.random.seed(42)
# Control
n_control = 5000
ctrl_impressions = np.random.poisson(40, n_control)
ctrl_ctr = np.random.beta(2, 20, n_control) # ~10% CTR
ctrl_clicks = np.random.binomial(ctrl_impressions, ctrl_ctr)
# Treatment (5% lift in CTR)
n_treatment = 5000
treat_impressions = np.random.poisson(40, n_treatment)
treat_ctr = np.random.beta(2.1, 20, n_treatment) # Slightly higher
treat_clicks = np.random.binomial(treat_impressions, treat_ctr)
result = compare_ratios_delta(ctrl_clicks, ctrl_impressions,
treat_clicks, treat_impressions)
print("A/B Test: CTR Comparison (Delta Method)")
print("=" * 60)
print(f"Control CTR: {result['control_ratio']:.4f} (SE: {result['control_se']:.4f})")
print(f"Treatment CTR: {result['treatment_ratio']:.4f} (SE: {result['treatment_se']:.4f})")
print(f"\nDifference: {result['difference']:.4f}")
print(f"Relative Lift: {result['lift']:.2%}")
print(f"SE of Difference: {result['se_diff']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
The Better Alternative: User-Level Metrics
Ratio Metric vs. User-Level Metric
| Approach | Definition | Weighting | Variance |
|---|---|---|---|
| Ratio (CTR) | Σclicks / Σimpressions | Heavy users weighted more | Delta method needed |
| User-level | Mean(clicks_i / impressions_i) | Equal per user | Standard methods work |
| User-level (simple) | Mean(clicks_i) | Equal per user, no normalization | Simplest |
When They Differ
def compare_ratio_vs_user_level():
"""
Show when ratio and user-level metrics give different answers.
"""
np.random.seed(42)
# Scenario: Heavy users have lower CTR
n_users = 1000
# Light users: few impressions, high CTR
n_light = 800
light_impressions = np.random.poisson(10, n_light)
light_ctr = 0.15
light_clicks = np.random.binomial(light_impressions, light_ctr)
# Heavy users: many impressions, lower CTR
n_heavy = 200
heavy_impressions = np.random.poisson(200, n_heavy)
heavy_ctr = 0.08
heavy_clicks = np.random.binomial(heavy_impressions, heavy_ctr)
# Combine
all_clicks = np.concatenate([light_clicks, heavy_clicks])
all_impressions = np.concatenate([light_impressions, heavy_impressions])
# Ratio metric (total CTR)
ratio_ctr = np.sum(all_clicks) / np.sum(all_impressions)
# User-level CTR (mean of individual CTRs)
user_ctrs = np.where(all_impressions > 0, all_clicks / all_impressions, 0)
user_level_ctr = np.mean(user_ctrs[all_impressions > 0])
print("Ratio vs. User-Level Metric")
print("=" * 50)
print(f"Total impressions from heavy users: {np.sum(heavy_impressions):,}")
print(f"Total impressions from light users: {np.sum(light_impressions):,}")
print(f"\nRatio CTR (weighted by impressions): {ratio_ctr:.4f}")
print(f"User-level CTR (equal weight per user): {user_level_ctr:.4f}")
print(f"\nDifference: {(ratio_ctr - user_level_ctr):.4f}")
print("\nRatio is pulled toward heavy users' lower CTR")
print("User-level reflects the 'typical' user experience")
compare_ratio_vs_user_level()
Recommendation: Use User-Level Metrics
def user_level_comparison(control_df, treatment_df, numerator_col, denominator_col=None):
"""
Compare groups using user-level metrics.
If denominator_col is None, just compare means of numerator.
If denominator_col is provided, compare means of (numerator/denominator).
"""
if denominator_col:
# Normalized user-level metric
ctrl_metric = control_df[numerator_col] / control_df[denominator_col]
treat_metric = treatment_df[numerator_col] / treatment_df[denominator_col]
# Handle zeros
ctrl_metric = ctrl_metric[control_df[denominator_col] > 0]
treat_metric = treat_metric[treatment_df[denominator_col] > 0]
else:
# Simple user-level metric
ctrl_metric = control_df[numerator_col]
treat_metric = treatment_df[numerator_col]
# Standard t-test (or Welch's)
t_stat, p_value = stats.ttest_ind(ctrl_metric, treat_metric)
# Effect size
diff = treat_metric.mean() - ctrl_metric.mean()
se_diff = np.sqrt(ctrl_metric.var()/len(ctrl_metric) +
treat_metric.var()/len(treat_metric))
return {
'control_mean': ctrl_metric.mean(),
'treatment_mean': treat_metric.mean(),
'difference': diff,
'se': se_diff,
'ci_lower': diff - 1.96 * se_diff,
'ci_upper': diff + 1.96 * se_diff,
'p_value': p_value
}
# Example
import pandas as pd
np.random.seed(42)
control = pd.DataFrame({
'user_id': range(3000),
'clicks': np.random.poisson(4, 3000),
'impressions': np.random.poisson(50, 3000)
})
treatment = pd.DataFrame({
'user_id': range(3000),
'clicks': np.random.poisson(4.2, 3000), # 5% more clicks
'impressions': np.random.poisson(50, 3000)
})
# Simple user-level: clicks per user
result_simple = user_level_comparison(control, treatment, 'clicks')
print("User-Level Analysis: Clicks per User")
print("=" * 50)
print(f"Control: {result_simple['control_mean']:.3f}")
print(f"Treatment: {result_simple['treatment_mean']:.3f}")
print(f"Difference: {result_simple['difference']:.3f}")
print(f"p-value: {result_simple['p_value']:.4f}")
When to Use Each Approach
Use Ratio Metrics When:
- Business cares about aggregate efficiency: Revenue per impression
- Comparing to industry benchmarks: Usually reported as ratios
- Unit economics matter: Cost per acquisition
Use User-Level Metrics When:
- Running experiments: Aligns with randomization unit
- Want equal user weighting: Fair representation
- Standard statistical methods: No special variance needed
- Interpretability: "Average user clicks X times"
Decision Matrix
| Scenario | Recommendation |
|---|---|
| A/B test, user-randomized | User-level metric |
| A/B test, page-randomized | Ratio with delta method |
| Reporting dashboard | Ratio (standard industry) |
| Heavy user concentration | User-level (avoid bias) |
| All users have equal exposure | Either (equivalent) |
R Implementation
library(tidyverse)
# Delta method for ratio
delta_method_se <- function(numerator, denominator) {
n <- length(numerator)
mean_y <- mean(numerator)
mean_x <- mean(denominator)
ratio <- mean_y / mean_x
var_y <- var(numerator)
var_x <- var(denominator)
cov_xy <- cov(numerator, denominator)
var_ratio <- (1 / mean_x^2) * (
var_y - 2 * ratio * cov_xy + ratio^2 * var_x
) / n
list(
ratio = ratio,
se = sqrt(var_ratio)
)
}
# Compare two groups
compare_ratios <- function(num_ctrl, denom_ctrl, num_treat, denom_treat) {
ctrl <- delta_method_se(num_ctrl, denom_ctrl)
treat <- delta_method_se(num_treat, denom_treat)
diff <- treat$ratio - ctrl$ratio
se_diff <- sqrt(ctrl$se^2 + treat$se^2)
z <- diff / se_diff
p_value <- 2 * pnorm(-abs(z))
list(
control = ctrl$ratio,
treatment = treat$ratio,
difference = diff,
se = se_diff,
p_value = p_value
)
}
# User-level alternative
user_level_test <- function(ctrl_values, treat_values) {
t.test(treat_values, ctrl_values)
}
Related Methods
- Metric Distributions (Pillar) - Full distributions overview
- Delta Method vs. Bootstrap - Variance estimation comparison
- Comparing ARPU/ARPPU - Revenue ratio metrics
- A/B Testing Statistical Methods - Experiment design
Key Takeaway
Ratio metrics like CTR require careful variance estimation—the naive binomial formula underestimates standard errors because it ignores the correlation between numerator and denominator. The delta method provides correct standard errors. But often, a simpler approach is better: use user-level metrics (clicks per user, conversions per user) instead of ratio metrics. User-level metrics have straightforward variance, align with experiment randomization, and give each user equal weight. For most A/B tests, user-level metrics are the preferred choice.
References
- https://www.exp-platform.com/Documents/2013-02-WSDM-DeltaMethodPaper.pdf
- https://doi.org/10.1145/3219819.3219919
- https://arxiv.org/abs/1803.06336
- Deng, A., Lu, J., & Chen, S. (2016). Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. *IEEE DSAA*, 243-252.
- Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., & Pohlmann, N. (2013). Online controlled experiments at large scale. *KDD*, 1168-1176.
- Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. *WSDM*, 123-132.
Frequently Asked Questions
Why can't I just compute variance of the ratio directly?
What's the difference between CTR and clicks-per-user?
When should I use delta method vs. bootstrap for ratios?
Key Takeaway
Ratio metrics require careful variance estimation because the numerator and denominator are typically correlated. The delta method provides correct standard errors by accounting for this correlation. An often simpler alternative: convert to user-level metrics (clicks per user, conversions per user) which have straightforward variance and don't require special treatment. For A/B tests, user-level metrics align with randomization and are generally preferred over ratio metrics.