Distributions

Comparing ARPU and ARPPU: Segmentation vs. Modeling Approaches

How to properly analyze revenue per user metrics in A/B tests. Learn the statistical pitfalls of ARPU vs. ARPPU, when to segment, and how to avoid Simpson's paradox.

Jan 269 min readstatstest_flow Distributions Supporting

Comparing ARPU and ARPPU: Segmentation vs. Modeling Approaches

Quick Hits

•ARPU = Total Revenue / All Users; ARPPU = Total Revenue / Paying Users
•ARPU changes with both conversion and spend-per-payer; ARPPU isolates monetization
•Comparing ARPPU across groups with different conversion rates can mislead
•Simpson's paradox: ARPPU up in both segments, down overall—it happens
•Model conversion and monetization separately, then combine if needed

TL;DR

ARPU (Average Revenue Per User) includes everyone; ARPPU (Average Revenue Per Paying User) conditions on paying. For A/B tests, ARPU is usually preferred because it captures total effect and avoids selection bias. ARPPU can mislead when treatment changes who pays—if treatment converts more low-value users, ARPPU drops even if total revenue rises. This guide covers when to use each, how to segment safely, and how to model the components separately.

The Metrics Defined

ARPU: Average Revenue Per User

$\text{ARPU} = \frac{\text{Total Revenue}}{\text{Total Users}}$

Includes: Everyone, payers and non-payers Interpretation: Average revenue generated per user Business alignment: Directly tied to total revenue

ARPPU: Average Revenue Per Paying User

$\text{ARPPU} = \frac{\text{Total Revenue}}{\text{Number of Payers}}$

Includes: Only paying users Interpretation: Average spend among those who pay Use case: Understanding monetization depth

The Relationship

$\text{ARPU} = \text{Conversion Rate} \times \text{ARPPU}$

Where: $\text{Conversion Rate} = \frac{\text{Number of Payers}}{\text{Total Users}}$

The Problem with Comparing ARPPU

Selection Bias in Action

When treatment affects who pays, ARPPU comparisons are biased.

import numpy as np
import pandas as pd


def simulate_selection_bias():
    """
    Demonstrate how ARPPU can mislead due to selection bias.
    """
    np.random.seed(42)
    n = 10000

    # Two types of users: high-value (would spend $100) and low-value (would spend $20)
    user_type = np.random.choice(['high', 'low'], n, p=[0.3, 0.7])

    # Control: only high-value users convert (5%)
    # Treatment: both types convert more (high: 7%, low: 4%)
    group = np.random.choice(['control', 'treatment'], n)

    def simulate_revenue(user_type, group):
        if group == 'control':
            if user_type == 'high':
                pays = np.random.random() < 0.05  # 5% conversion
                return 100 if pays else 0
            else:
                pays = np.random.random() < 0.01  # 1% conversion
                return 20 if pays else 0
        else:  # treatment
            if user_type == 'high':
                pays = np.random.random() < 0.07  # 7% conversion
                return 100 if pays else 0
            else:
                pays = np.random.random() < 0.04  # 4% conversion (big increase!)
                return 20 if pays else 0

    revenue = [simulate_revenue(ut, g) for ut, g in zip(user_type, group)]

    data = pd.DataFrame({
        'user_type': user_type,
        'group': group,
        'revenue': revenue
    })

    return data


data = simulate_selection_bias()

# Compute metrics by group
results = data.groupby('group').agg({
    'revenue': ['sum', 'count', lambda x: (x > 0).sum()]
}).reset_index()
results.columns = ['group', 'total_revenue', 'n_users', 'n_payers']
results['ARPU'] = results['total_revenue'] / results['n_users']
results['ARPPU'] = results['total_revenue'] / results['n_payers']
results['conversion'] = results['n_payers'] / results['n_users']

print("ARPU vs ARPPU: Selection Bias Example")
print("=" * 60)
print("\nScenario: Treatment increases conversion, especially for low-value users")
print()
print(f"{'Metric':<20} {'Control':>15} {'Treatment':>15} {'Change':>15}")
print("-" * 60)

ctrl = results[results['group'] == 'control'].iloc[0]
treat = results[results['group'] == 'treatment'].iloc[0]

print(f"{'Total Users':<20} {ctrl['n_users']:>15,.0f} {treat['n_users']:>15,.0f}")
print(f"{'Payers':<20} {ctrl['n_payers']:>15,.0f} {treat['n_payers']:>15,.0f}")
print(f"{'Conversion Rate':<20} {ctrl['conversion']:>15.2%} {treat['conversion']:>15.2%} {(treat['conversion']/ctrl['conversion']-1):>+14.1%}")
print(f"{'Total Revenue':<20} ${ctrl['total_revenue']:>14,.0f} ${treat['total_revenue']:>14,.0f} {(treat['total_revenue']/ctrl['total_revenue']-1):>+14.1%}")
print(f"{'ARPU':<20} ${ctrl['ARPU']:>14.2f} ${treat['ARPU']:>14.2f} {(treat['ARPU']/ctrl['ARPU']-1):>+14.1%}")
print(f"{'ARPPU':<20} ${ctrl['ARPPU']:>14.2f} ${treat['ARPPU']:>14.2f} {(treat['ARPPU']/ctrl['ARPPU']-1):>+14.1%}")

print("\n⚠️ ARPPU decreased even though treatment is clearly better!")
print("   (More revenue, more payers, higher ARPU)")
print("   ARPPU dropped because we added many $20 payers to the denominator.")

Why This Happens

Treatment changed who pays, not how much payers spend:

More low-value users started paying
This dilutes ARPPU even as total revenue increases
ARPPU ↓ while ARPU ↑ and Total Revenue ↑

Simpson's Paradox in Revenue

The Classic Trap

ARPPU can increase in every segment but decrease overall.

def simpsons_paradox_example():
    """
    ARPPU increases in both segments, decreases overall.
    """
    # Segment A: High spenders
    # Segment B: Low spenders

    data = {
        'Segment A': {
            'control': {'payers': 100, 'revenue': 10000},   # ARPPU = $100
            'treatment': {'payers': 90, 'revenue': 9450}    # ARPPU = $105 (+5%)
        },
        'Segment B': {
            'control': {'payers': 50, 'revenue': 1000},     # ARPPU = $20
            'treatment': {'payers': 200, 'revenue': 4400}   # ARPPU = $22 (+10%)
        }
    }

    print("Simpson's Paradox: ARPPU")
    print("=" * 60)

    for segment in ['Segment A', 'Segment B']:
        ctrl = data[segment]['control']
        treat = data[segment]['treatment']
        arppu_ctrl = ctrl['revenue'] / ctrl['payers']
        arppu_treat = treat['revenue'] / treat['payers']
        print(f"\n{segment}:")
        print(f"  Control ARPPU: ${arppu_ctrl:.2f} ({ctrl['payers']} payers)")
        print(f"  Treatment ARPPU: ${arppu_treat:.2f} ({treat['payers']} payers)")
        print(f"  Change: {(arppu_treat/arppu_ctrl-1):+.1%}")

    # Overall
    ctrl_payers = sum(d['control']['payers'] for d in data.values())
    ctrl_revenue = sum(d['control']['revenue'] for d in data.values())
    treat_payers = sum(d['treatment']['payers'] for d in data.values())
    treat_revenue = sum(d['treatment']['revenue'] for d in data.values())

    print(f"\n{'='*60}")
    print("OVERALL:")
    print(f"  Control ARPPU: ${ctrl_revenue/ctrl_payers:.2f} ({ctrl_payers} payers)")
    print(f"  Treatment ARPPU: ${treat_revenue/treat_payers:.2f} ({treat_payers} payers)")
    print(f"  Change: {(treat_revenue/treat_payers)/(ctrl_revenue/ctrl_payers)-1:+.1%}")

    print("\n⚠️ ARPPU UP in both segments, DOWN overall!")
    print("   Treatment shifted composition toward lower-spending Segment B")


simpsons_paradox_example()

The Right Approach: Model Components Separately

Two-Part Decomposition

import numpy as np
import pandas as pd
from scipy import stats


def analyze_revenue_components(data, group_col='group', revenue_col='revenue'):
    """
    Analyze conversion and monetization separately.
    """
    results = {}

    for group in data[group_col].unique():
        subset = data[data[group_col] == group]

        n_users = len(subset)
        n_payers = (subset[revenue_col] > 0).sum()
        total_revenue = subset[revenue_col].sum()

        conversion = n_payers / n_users
        arppu = total_revenue / n_payers if n_payers > 0 else 0
        arpu = total_revenue / n_users

        # Revenue distribution among payers
        payer_revenue = subset.loc[subset[revenue_col] > 0, revenue_col]

        results[group] = {
            'n_users': n_users,
            'n_payers': n_payers,
            'conversion': conversion,
            'arppu': arppu,
            'arpu': arpu,
            'total_revenue': total_revenue,
            'payer_revenue_mean': payer_revenue.mean() if len(payer_revenue) > 0 else 0,
            'payer_revenue_std': payer_revenue.std() if len(payer_revenue) > 1 else 0
        }

    return results


def test_components(data, group_col='group', revenue_col='revenue'):
    """
    Separate hypothesis tests for conversion and monetization.
    """
    groups = sorted(data[group_col].unique())
    control = data[data[group_col] == groups[0]]
    treatment = data[data[group_col] == groups[1]]

    # Test 1: Conversion (proportion test)
    n_c = len(control)
    n_t = len(treatment)
    conv_c = (control[revenue_col] > 0).sum() / n_c
    conv_t = (treatment[revenue_col] > 0).sum() / n_t

    pooled_conv = ((control[revenue_col] > 0).sum() + (treatment[revenue_col] > 0).sum()) / (n_c + n_t)
    se_conv = np.sqrt(pooled_conv * (1 - pooled_conv) * (1/n_c + 1/n_t))
    z_conv = (conv_t - conv_c) / se_conv
    p_conv = 2 * (1 - stats.norm.cdf(abs(z_conv)))

    # Test 2: Monetization among payers (t-test or Mann-Whitney)
    payers_c = control.loc[control[revenue_col] > 0, revenue_col]
    payers_t = treatment.loc[treatment[revenue_col] > 0, revenue_col]

    if len(payers_c) > 1 and len(payers_t) > 1:
        # Mann-Whitney for robustness
        mw_stat, p_monetization = stats.mannwhitneyu(payers_c, payers_t, alternative='two-sided')
        mean_diff = payers_t.mean() - payers_c.mean()
    else:
        p_monetization = np.nan
        mean_diff = np.nan

    # Test 3: Overall ARPU (t-test)
    _, p_arpu = stats.ttest_ind(control[revenue_col], treatment[revenue_col])
    arpu_diff = treatment[revenue_col].mean() - control[revenue_col].mean()

    return {
        'conversion': {
            'control': conv_c,
            'treatment': conv_t,
            'lift': (conv_t - conv_c) / conv_c if conv_c > 0 else np.inf,
            'p_value': p_conv
        },
        'monetization': {
            'control_arppu': payers_c.mean() if len(payers_c) > 0 else 0,
            'treatment_arppu': payers_t.mean() if len(payers_t) > 0 else 0,
            'difference': mean_diff,
            'p_value': p_monetization
        },
        'overall': {
            'control_arpu': control[revenue_col].mean(),
            'treatment_arpu': treatment[revenue_col].mean(),
            'difference': arpu_diff,
            'p_value': p_arpu
        }
    }


# Example
np.random.seed(42)
n = 5000

# Simulate data
data = pd.DataFrame({
    'group': np.random.choice(['control', 'treatment'], n),
})

# Control: 10% convert, spend ~$50
# Treatment: 12% convert, spend ~$48 (slightly lower per payer)
def simulate_revenue(group):
    if group == 'control':
        if np.random.random() < 0.10:
            return np.random.lognormal(3.9, 0.5)  # Mean ~$50
        return 0
    else:
        if np.random.random() < 0.12:
            return np.random.lognormal(3.85, 0.5)  # Mean ~$48
        return 0

data['revenue'] = data['group'].apply(simulate_revenue)

# Analyze
print("Component Analysis: Conversion vs. Monetization")
print("=" * 70)

test_results = test_components(data)

print("\n1. CONVERSION (P(pay)):")
print(f"   Control: {test_results['conversion']['control']:.2%}")
print(f"   Treatment: {test_results['conversion']['treatment']:.2%}")
print(f"   Lift: {test_results['conversion']['lift']:.1%}")
print(f"   p-value: {test_results['conversion']['p_value']:.4f}")

print("\n2. MONETIZATION (ARPPU, spend | pay):")
print(f"   Control: ${test_results['monetization']['control_arppu']:.2f}")
print(f"   Treatment: ${test_results['monetization']['treatment_arppu']:.2f}")
print(f"   Difference: ${test_results['monetization']['difference']:.2f}")
print(f"   p-value: {test_results['monetization']['p_value']:.4f}")

print("\n3. OVERALL (ARPU):")
print(f"   Control: ${test_results['overall']['control_arpu']:.2f}")
print(f"   Treatment: ${test_results['overall']['treatment_arpu']:.2f}")
print(f"   Difference: ${test_results['overall']['difference']:.2f}")
print(f"   p-value: {test_results['overall']['p_value']:.4f}")

print("\nInterpretation:")
print("   Treatment significantly increased conversion (+20%)")
print("   Treatment slightly decreased ARPPU (not significant)")
print("   Net effect on ARPU is positive")

When to Segment (Safely)

Safe Segmentation

Segment on characteristics fixed before treatment:

User tenure (at assignment)
Geographic region
Device type
Historical spending tier

Dangerous Segmentation

Segment on characteristics affected by treatment:

Post-treatment engagement level
Whether they made a purchase (conditioning on outcome!)
Feature usage during experiment

Code: Pre-Stratified Analysis

def stratified_analysis(data, group_col, revenue_col, segment_col):
    """
    Analyze revenue with pre-defined segments.
    """
    results = []

    for segment in data[segment_col].unique():
        seg_data = data[data[segment_col] == segment]

        for group in data[group_col].unique():
            subset = seg_data[seg_data[group_col] == group]

            results.append({
                'segment': segment,
                'group': group,
                'n_users': len(subset),
                'n_payers': (subset[revenue_col] > 0).sum(),
                'total_revenue': subset[revenue_col].sum(),
                'arpu': subset[revenue_col].mean(),
                'conversion': (subset[revenue_col] > 0).mean()
            })

    return pd.DataFrame(results)


# Example with pre-defined segments
np.random.seed(42)
n = 6000

data = pd.DataFrame({
    'user_segment': np.random.choice(['New', 'Existing'], n, p=[0.4, 0.6]),
    'group': np.random.choice(['control', 'treatment'], n)
})

# Different effects by segment
def simulate_segmented_revenue(row):
    if row['user_segment'] == 'New':
        base_conv = 0.05
        base_spend = 30
        treatment_conv_lift = 0.50  # 50% lift for new users
    else:
        base_conv = 0.15
        base_spend = 80
        treatment_conv_lift = 0.10  # 10% lift for existing

    conv = base_conv * (1 + treatment_conv_lift * (row['group'] == 'treatment'))
    if np.random.random() < conv:
        return np.random.lognormal(np.log(base_spend), 0.5)
    return 0

data['revenue'] = data.apply(simulate_segmented_revenue, axis=1)

# Stratified analysis
strat_results = stratified_analysis(data, 'group', 'revenue', 'user_segment')
print("\nStratified Analysis by User Segment")
print("=" * 70)
print(strat_results.pivot_table(
    index='segment',
    columns='group',
    values=['n_users', 'conversion', 'arpu'],
    aggfunc='first'
).round(3))

Decision Framework

START: Analyzing revenue in A/B test
       ↓
QUESTION: What's your primary metric?
├── Total revenue impact → Use ARPU
├── Understanding monetization → Continue
└── Both → Analyze separately
       ↓
QUESTION: Does treatment affect who converts?
├── Yes → ARPPU is biased, use ARPU or two-part model
└── No or unlikely → ARPPU comparison is valid
       ↓
QUESTION: Do you want to segment?
├── By pre-treatment characteristics → Safe
├── By post-treatment characteristics → Dangerous
└── By treatment-affected characteristics → Don't
       ↓
REPORT:
1. Primary: ARPU (total revenue efficiency)
2. Secondary: Conversion rate (P(pay))
3. Exploratory: ARPPU (with caveats about selection)
4. Decomposition: ARPU = Conversion × ARPPU

Reporting Template

## Revenue Analysis

### Primary Metric: ARPU
- Control: $X.XX
- Treatment: $Y.YY
- Lift: +Z.Z% (95% CI: A% to B%)
- p-value: 0.XXX

### Component Decomposition
| Component | Control | Treatment | Lift | p-value |
|-----------|---------|-----------|------|---------|
| Conversion Rate | X% | Y% | +Z% | 0.XXX |
| ARPPU | $X | $Y | +Z% | 0.XXX |

### Interpretation
Treatment increased total revenue per user by Z%. This was driven primarily
by [conversion/monetization/both]:
- Conversion [increased/decreased] by X%, meaning [more/fewer] users made purchases
- ARPPU [increased/decreased] by Y%, meaning payers [spent more/spent less] on average

Note: ARPPU comparison should be interpreted cautiously as treatment affected
conversion rates, changing the composition of payers.

Metric Distributions (Pillar) - Full distributions overview
Why Revenue Is Hard - Revenue challenges
Dealing with Zeros - Zero-handling approaches
Ratio Metrics - Ratio metric pitfalls

Key Takeaway

ARPU measures what your business cares about: total revenue efficiency. ARPPU measures monetization depth but is vulnerable to selection bias when treatment changes who pays. For A/B tests, prefer ARPU as the primary metric. If you analyze ARPPU, acknowledge that composition changes can make the comparison misleading. The safest approach: decompose into conversion (who pays) and monetization (how much they spend), test each separately, and only compare ARPPU when you're confident treatment doesn't affect the payer population composition.

References

https://doi.org/10.1287/mksc.2018.1092
https://www.kdd.org/kdd2016/papers/files/Paper_573.pdf
https://exp-platform.com/Documents/2017-08%20KDDMetricInterpretationPitfalls.pdf
Deng, A., & Shi, X. (2016). Data-driven metric development for online controlled experiments. *KDD*, 77-86.
Kohavi, R., Deng, A., Longbotham, R., & Xu, Y. (2014). Seven pitfalls to avoid when running controlled experiments on the web. *KDD*, 1105-1114.
Richardson, A., & Dominowska, E. (2017). Metric interpretation pitfalls in online controlled experiments. *KDD Workshop*.

Frequently Asked Questions

Should I use ARPU or ARPPU for A/B tests?

ARPU is usually preferred because it captures the total revenue effect and aligns with the randomization unit (all users). ARPPU is useful for understanding monetization among payers but can be misleading if treatment affects who becomes a payer (selection bias).

Why can ARPPU go up while total revenue goes down?

If treatment increases conversion among low-value users, you get more payers (good) but lower average among payers (ARPPU drops). Total revenue could still increase. Conversely, if treatment scares off low-value payers, ARPPU rises (fewer, wealthier payers) but revenue falls.

How do I avoid Simpson's paradox in revenue analysis?

Don't compare ARPPU across groups with different composition. Either use ARPU (includes everyone), or explicitly model selection: P(pay) × E[revenue | pay]. If you segment, check that segment definitions aren't affected by treatment.

Key Takeaway

ARPU and ARPPU answer different questions. ARPU measures total revenue efficiency across all users—the metric your business cares about. ARPPU measures monetization among payers but is vulnerable to selection bias when treatment affects who pays. For experiments, prefer ARPU or explicitly model conversion × monetization. When segmenting, ensure segment membership isn't affected by treatment, or you risk Simpson's paradox and misleading conclusions.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email