Contents
Comparing ARPU and ARPPU: Segmentation vs. Modeling Approaches
How to properly analyze revenue per user metrics in A/B tests. Learn the statistical pitfalls of ARPU vs. ARPPU, when to segment, and how to avoid Simpson's paradox.
Quick Hits
- •ARPU = Total Revenue / All Users; ARPPU = Total Revenue / Paying Users
- •ARPU changes with both conversion and spend-per-payer; ARPPU isolates monetization
- •Comparing ARPPU across groups with different conversion rates can mislead
- •Simpson's paradox: ARPPU up in both segments, down overall—it happens
- •Model conversion and monetization separately, then combine if needed
TL;DR
ARPU (Average Revenue Per User) includes everyone; ARPPU (Average Revenue Per Paying User) conditions on paying. For A/B tests, ARPU is usually preferred because it captures total effect and avoids selection bias. ARPPU can mislead when treatment changes who pays—if treatment converts more low-value users, ARPPU drops even if total revenue rises. This guide covers when to use each, how to segment safely, and how to model the components separately.
The Metrics Defined
ARPU: Average Revenue Per User
$$\text{ARPU} = \frac{\text{Total Revenue}}{\text{Total Users}}$$
Includes: Everyone, payers and non-payers Interpretation: Average revenue generated per user Business alignment: Directly tied to total revenue
ARPPU: Average Revenue Per Paying User
$$\text{ARPPU} = \frac{\text{Total Revenue}}{\text{Number of Payers}}$$
Includes: Only paying users Interpretation: Average spend among those who pay Use case: Understanding monetization depth
The Relationship
$$\text{ARPU} = \text{Conversion Rate} \times \text{ARPPU}$$
Where: $$\text{Conversion Rate} = \frac{\text{Number of Payers}}{\text{Total Users}}$$
The Problem with Comparing ARPPU
Selection Bias in Action
When treatment affects who pays, ARPPU comparisons are biased.
import numpy as np
import pandas as pd
def simulate_selection_bias():
"""
Demonstrate how ARPPU can mislead due to selection bias.
"""
np.random.seed(42)
n = 10000
# Two types of users: high-value (would spend $100) and low-value (would spend $20)
user_type = np.random.choice(['high', 'low'], n, p=[0.3, 0.7])
# Control: only high-value users convert (5%)
# Treatment: both types convert more (high: 7%, low: 4%)
group = np.random.choice(['control', 'treatment'], n)
def simulate_revenue(user_type, group):
if group == 'control':
if user_type == 'high':
pays = np.random.random() < 0.05 # 5% conversion
return 100 if pays else 0
else:
pays = np.random.random() < 0.01 # 1% conversion
return 20 if pays else 0
else: # treatment
if user_type == 'high':
pays = np.random.random() < 0.07 # 7% conversion
return 100 if pays else 0
else:
pays = np.random.random() < 0.04 # 4% conversion (big increase!)
return 20 if pays else 0
revenue = [simulate_revenue(ut, g) for ut, g in zip(user_type, group)]
data = pd.DataFrame({
'user_type': user_type,
'group': group,
'revenue': revenue
})
return data
data = simulate_selection_bias()
# Compute metrics by group
results = data.groupby('group').agg({
'revenue': ['sum', 'count', lambda x: (x > 0).sum()]
}).reset_index()
results.columns = ['group', 'total_revenue', 'n_users', 'n_payers']
results['ARPU'] = results['total_revenue'] / results['n_users']
results['ARPPU'] = results['total_revenue'] / results['n_payers']
results['conversion'] = results['n_payers'] / results['n_users']
print("ARPU vs ARPPU: Selection Bias Example")
print("=" * 60)
print("\nScenario: Treatment increases conversion, especially for low-value users")
print()
print(f"{'Metric':<20} {'Control':>15} {'Treatment':>15} {'Change':>15}")
print("-" * 60)
ctrl = results[results['group'] == 'control'].iloc[0]
treat = results[results['group'] == 'treatment'].iloc[0]
print(f"{'Total Users':<20} {ctrl['n_users']:>15,.0f} {treat['n_users']:>15,.0f}")
print(f"{'Payers':<20} {ctrl['n_payers']:>15,.0f} {treat['n_payers']:>15,.0f}")
print(f"{'Conversion Rate':<20} {ctrl['conversion']:>15.2%} {treat['conversion']:>15.2%} {(treat['conversion']/ctrl['conversion']-1):>+14.1%}")
print(f"{'Total Revenue':<20} ${ctrl['total_revenue']:>14,.0f} ${treat['total_revenue']:>14,.0f} {(treat['total_revenue']/ctrl['total_revenue']-1):>+14.1%}")
print(f"{'ARPU':<20} ${ctrl['ARPU']:>14.2f} ${treat['ARPU']:>14.2f} {(treat['ARPU']/ctrl['ARPU']-1):>+14.1%}")
print(f"{'ARPPU':<20} ${ctrl['ARPPU']:>14.2f} ${treat['ARPPU']:>14.2f} {(treat['ARPPU']/ctrl['ARPPU']-1):>+14.1%}")
print("\n⚠️ ARPPU decreased even though treatment is clearly better!")
print(" (More revenue, more payers, higher ARPU)")
print(" ARPPU dropped because we added many $20 payers to the denominator.")
Why This Happens
Treatment changed who pays, not how much payers spend:
- More low-value users started paying
- This dilutes ARPPU even as total revenue increases
- ARPPU ↓ while ARPU ↑ and Total Revenue ↑
Simpson's Paradox in Revenue
The Classic Trap
ARPPU can increase in every segment but decrease overall.
def simpsons_paradox_example():
"""
ARPPU increases in both segments, decreases overall.
"""
# Segment A: High spenders
# Segment B: Low spenders
data = {
'Segment A': {
'control': {'payers': 100, 'revenue': 10000}, # ARPPU = $100
'treatment': {'payers': 90, 'revenue': 9450} # ARPPU = $105 (+5%)
},
'Segment B': {
'control': {'payers': 50, 'revenue': 1000}, # ARPPU = $20
'treatment': {'payers': 200, 'revenue': 4400} # ARPPU = $22 (+10%)
}
}
print("Simpson's Paradox: ARPPU")
print("=" * 60)
for segment in ['Segment A', 'Segment B']:
ctrl = data[segment]['control']
treat = data[segment]['treatment']
arppu_ctrl = ctrl['revenue'] / ctrl['payers']
arppu_treat = treat['revenue'] / treat['payers']
print(f"\n{segment}:")
print(f" Control ARPPU: ${arppu_ctrl:.2f} ({ctrl['payers']} payers)")
print(f" Treatment ARPPU: ${arppu_treat:.2f} ({treat['payers']} payers)")
print(f" Change: {(arppu_treat/arppu_ctrl-1):+.1%}")
# Overall
ctrl_payers = sum(d['control']['payers'] for d in data.values())
ctrl_revenue = sum(d['control']['revenue'] for d in data.values())
treat_payers = sum(d['treatment']['payers'] for d in data.values())
treat_revenue = sum(d['treatment']['revenue'] for d in data.values())
print(f"\n{'='*60}")
print("OVERALL:")
print(f" Control ARPPU: ${ctrl_revenue/ctrl_payers:.2f} ({ctrl_payers} payers)")
print(f" Treatment ARPPU: ${treat_revenue/treat_payers:.2f} ({treat_payers} payers)")
print(f" Change: {(treat_revenue/treat_payers)/(ctrl_revenue/ctrl_payers)-1:+.1%}")
print("\n⚠️ ARPPU UP in both segments, DOWN overall!")
print(" Treatment shifted composition toward lower-spending Segment B")
simpsons_paradox_example()
The Right Approach: Model Components Separately
Two-Part Decomposition
import numpy as np
import pandas as pd
from scipy import stats
def analyze_revenue_components(data, group_col='group', revenue_col='revenue'):
"""
Analyze conversion and monetization separately.
"""
results = {}
for group in data[group_col].unique():
subset = data[data[group_col] == group]
n_users = len(subset)
n_payers = (subset[revenue_col] > 0).sum()
total_revenue = subset[revenue_col].sum()
conversion = n_payers / n_users
arppu = total_revenue / n_payers if n_payers > 0 else 0
arpu = total_revenue / n_users
# Revenue distribution among payers
payer_revenue = subset.loc[subset[revenue_col] > 0, revenue_col]
results[group] = {
'n_users': n_users,
'n_payers': n_payers,
'conversion': conversion,
'arppu': arppu,
'arpu': arpu,
'total_revenue': total_revenue,
'payer_revenue_mean': payer_revenue.mean() if len(payer_revenue) > 0 else 0,
'payer_revenue_std': payer_revenue.std() if len(payer_revenue) > 1 else 0
}
return results
def test_components(data, group_col='group', revenue_col='revenue'):
"""
Separate hypothesis tests for conversion and monetization.
"""
groups = sorted(data[group_col].unique())
control = data[data[group_col] == groups[0]]
treatment = data[data[group_col] == groups[1]]
# Test 1: Conversion (proportion test)
n_c = len(control)
n_t = len(treatment)
conv_c = (control[revenue_col] > 0).sum() / n_c
conv_t = (treatment[revenue_col] > 0).sum() / n_t
pooled_conv = ((control[revenue_col] > 0).sum() + (treatment[revenue_col] > 0).sum()) / (n_c + n_t)
se_conv = np.sqrt(pooled_conv * (1 - pooled_conv) * (1/n_c + 1/n_t))
z_conv = (conv_t - conv_c) / se_conv
p_conv = 2 * (1 - stats.norm.cdf(abs(z_conv)))
# Test 2: Monetization among payers (t-test or Mann-Whitney)
payers_c = control.loc[control[revenue_col] > 0, revenue_col]
payers_t = treatment.loc[treatment[revenue_col] > 0, revenue_col]
if len(payers_c) > 1 and len(payers_t) > 1:
# Mann-Whitney for robustness
mw_stat, p_monetization = stats.mannwhitneyu(payers_c, payers_t, alternative='two-sided')
mean_diff = payers_t.mean() - payers_c.mean()
else:
p_monetization = np.nan
mean_diff = np.nan
# Test 3: Overall ARPU (t-test)
_, p_arpu = stats.ttest_ind(control[revenue_col], treatment[revenue_col])
arpu_diff = treatment[revenue_col].mean() - control[revenue_col].mean()
return {
'conversion': {
'control': conv_c,
'treatment': conv_t,
'lift': (conv_t - conv_c) / conv_c if conv_c > 0 else np.inf,
'p_value': p_conv
},
'monetization': {
'control_arppu': payers_c.mean() if len(payers_c) > 0 else 0,
'treatment_arppu': payers_t.mean() if len(payers_t) > 0 else 0,
'difference': mean_diff,
'p_value': p_monetization
},
'overall': {
'control_arpu': control[revenue_col].mean(),
'treatment_arpu': treatment[revenue_col].mean(),
'difference': arpu_diff,
'p_value': p_arpu
}
}
# Example
np.random.seed(42)
n = 5000
# Simulate data
data = pd.DataFrame({
'group': np.random.choice(['control', 'treatment'], n),
})
# Control: 10% convert, spend ~$50
# Treatment: 12% convert, spend ~$48 (slightly lower per payer)
def simulate_revenue(group):
if group == 'control':
if np.random.random() < 0.10:
return np.random.lognormal(3.9, 0.5) # Mean ~$50
return 0
else:
if np.random.random() < 0.12:
return np.random.lognormal(3.85, 0.5) # Mean ~$48
return 0
data['revenue'] = data['group'].apply(simulate_revenue)
# Analyze
print("Component Analysis: Conversion vs. Monetization")
print("=" * 70)
test_results = test_components(data)
print("\n1. CONVERSION (P(pay)):")
print(f" Control: {test_results['conversion']['control']:.2%}")
print(f" Treatment: {test_results['conversion']['treatment']:.2%}")
print(f" Lift: {test_results['conversion']['lift']:.1%}")
print(f" p-value: {test_results['conversion']['p_value']:.4f}")
print("\n2. MONETIZATION (ARPPU, spend | pay):")
print(f" Control: ${test_results['monetization']['control_arppu']:.2f}")
print(f" Treatment: ${test_results['monetization']['treatment_arppu']:.2f}")
print(f" Difference: ${test_results['monetization']['difference']:.2f}")
print(f" p-value: {test_results['monetization']['p_value']:.4f}")
print("\n3. OVERALL (ARPU):")
print(f" Control: ${test_results['overall']['control_arpu']:.2f}")
print(f" Treatment: ${test_results['overall']['treatment_arpu']:.2f}")
print(f" Difference: ${test_results['overall']['difference']:.2f}")
print(f" p-value: {test_results['overall']['p_value']:.4f}")
print("\nInterpretation:")
print(" Treatment significantly increased conversion (+20%)")
print(" Treatment slightly decreased ARPPU (not significant)")
print(" Net effect on ARPU is positive")
When to Segment (Safely)
Safe Segmentation
Segment on characteristics fixed before treatment:
- User tenure (at assignment)
- Geographic region
- Device type
- Historical spending tier
Dangerous Segmentation
Segment on characteristics affected by treatment:
- Post-treatment engagement level
- Whether they made a purchase (conditioning on outcome!)
- Feature usage during experiment
Code: Pre-Stratified Analysis
def stratified_analysis(data, group_col, revenue_col, segment_col):
"""
Analyze revenue with pre-defined segments.
"""
results = []
for segment in data[segment_col].unique():
seg_data = data[data[segment_col] == segment]
for group in data[group_col].unique():
subset = seg_data[seg_data[group_col] == group]
results.append({
'segment': segment,
'group': group,
'n_users': len(subset),
'n_payers': (subset[revenue_col] > 0).sum(),
'total_revenue': subset[revenue_col].sum(),
'arpu': subset[revenue_col].mean(),
'conversion': (subset[revenue_col] > 0).mean()
})
return pd.DataFrame(results)
# Example with pre-defined segments
np.random.seed(42)
n = 6000
data = pd.DataFrame({
'user_segment': np.random.choice(['New', 'Existing'], n, p=[0.4, 0.6]),
'group': np.random.choice(['control', 'treatment'], n)
})
# Different effects by segment
def simulate_segmented_revenue(row):
if row['user_segment'] == 'New':
base_conv = 0.05
base_spend = 30
treatment_conv_lift = 0.50 # 50% lift for new users
else:
base_conv = 0.15
base_spend = 80
treatment_conv_lift = 0.10 # 10% lift for existing
conv = base_conv * (1 + treatment_conv_lift * (row['group'] == 'treatment'))
if np.random.random() < conv:
return np.random.lognormal(np.log(base_spend), 0.5)
return 0
data['revenue'] = data.apply(simulate_segmented_revenue, axis=1)
# Stratified analysis
strat_results = stratified_analysis(data, 'group', 'revenue', 'user_segment')
print("\nStratified Analysis by User Segment")
print("=" * 70)
print(strat_results.pivot_table(
index='segment',
columns='group',
values=['n_users', 'conversion', 'arpu'],
aggfunc='first'
).round(3))
Decision Framework
START: Analyzing revenue in A/B test
↓
QUESTION: What's your primary metric?
├── Total revenue impact → Use ARPU
├── Understanding monetization → Continue
└── Both → Analyze separately
↓
QUESTION: Does treatment affect who converts?
├── Yes → ARPPU is biased, use ARPU or two-part model
└── No or unlikely → ARPPU comparison is valid
↓
QUESTION: Do you want to segment?
├── By pre-treatment characteristics → Safe
├── By post-treatment characteristics → Dangerous
└── By treatment-affected characteristics → Don't
↓
REPORT:
1. Primary: ARPU (total revenue efficiency)
2. Secondary: Conversion rate (P(pay))
3. Exploratory: ARPPU (with caveats about selection)
4. Decomposition: ARPU = Conversion × ARPPU
Reporting Template
## Revenue Analysis
### Primary Metric: ARPU
- Control: $X.XX
- Treatment: $Y.YY
- Lift: +Z.Z% (95% CI: A% to B%)
- p-value: 0.XXX
### Component Decomposition
| Component | Control | Treatment | Lift | p-value |
|-----------|---------|-----------|------|---------|
| Conversion Rate | X% | Y% | +Z% | 0.XXX |
| ARPPU | $X | $Y | +Z% | 0.XXX |
### Interpretation
Treatment increased total revenue per user by Z%. This was driven primarily
by [conversion/monetization/both]:
- Conversion [increased/decreased] by X%, meaning [more/fewer] users made purchases
- ARPPU [increased/decreased] by Y%, meaning payers [spent more/spent less] on average
Note: ARPPU comparison should be interpreted cautiously as treatment affected
conversion rates, changing the composition of payers.
Related Methods
- Metric Distributions (Pillar) - Full distributions overview
- Why Revenue Is Hard - Revenue challenges
- Dealing with Zeros - Zero-handling approaches
- Ratio Metrics - Ratio metric pitfalls
Key Takeaway
ARPU measures what your business cares about: total revenue efficiency. ARPPU measures monetization depth but is vulnerable to selection bias when treatment changes who pays. For A/B tests, prefer ARPU as the primary metric. If you analyze ARPPU, acknowledge that composition changes can make the comparison misleading. The safest approach: decompose into conversion (who pays) and monetization (how much they spend), test each separately, and only compare ARPPU when you're confident treatment doesn't affect the payer population composition.
References
- https://doi.org/10.1287/mksc.2018.1092
- https://www.kdd.org/kdd2016/papers/files/Paper_573.pdf
- https://exp-platform.com/Documents/2017-08%20KDDMetricInterpretationPitfalls.pdf
- Deng, A., & Shi, X. (2016). Data-driven metric development for online controlled experiments. *KDD*, 77-86.
- Kohavi, R., Deng, A., Longbotham, R., & Xu, Y. (2014). Seven pitfalls to avoid when running controlled experiments on the web. *KDD*, 1105-1114.
- Richardson, A., & Dominowska, E. (2017). Metric interpretation pitfalls in online controlled experiments. *KDD Workshop*.
Frequently Asked Questions
Should I use ARPU or ARPPU for A/B tests?
Why can ARPPU go up while total revenue goes down?
How do I avoid Simpson's paradox in revenue analysis?
Key Takeaway
ARPU and ARPPU answer different questions. ARPU measures total revenue efficiency across all users—the metric your business cares about. ARPPU measures monetization among payers but is vulnerable to selection bias when treatment affects who pays. For experiments, prefer ARPU or explicitly model conversion × monetization. When segmenting, ensure segment membership isn't affected by treatment, or you risk Simpson's paradox and misleading conclusions.