Distributions

Dealing with Zeros: Zero-Inflated and Two-Part Models

How to handle metrics with many zeros—revenue from non-purchasers, engagement from inactive users, events that didn't happen. Learn when to use zero-inflated models, two-part models, and simpler alternatives.

Jan 2612 min readstatstest_flow Distributions Supporting

Dealing with Zeros: Zero-Inflated and Two-Part Models

Quick Hits

•Two populations: structural zeros (will never purchase) vs. sampling zeros (could but didn't)
•Two-part models: separate probability of any value from amount given positive
•Zero-inflated models: mixture of always-zero and standard distribution
•Often simpler approaches work: analyze non-zeros separately, or just accept the variance
•Choice depends on whether you care about total effect or conditional effect

TL;DR

Many product metrics have lots of zeros: revenue (non-purchasers), engagement (inactive users), support tickets (no issues). Standard methods struggle because zeros represent a distinct population, not just small values. Two-part models split the problem: model probability of any positive value, then model the amount given positive. Zero-inflated models treat some zeros as "structural" (will never be positive) and others as "sampling" (could be positive but weren't). Often, simpler approaches work: analyze positives separately, or use robust methods that handle the variance.

Why Zeros Are Different

Two Types of Zeros

Structural zeros: User fundamentally won't have this behavior

Free users who can't purchase (paywall)
Users in regions without the feature
Bot accounts (no real engagement)

Sampling zeros: User could have the behavior but didn't in this period

Purchaser who didn't buy this week
Active user who didn't engage today
Customer who had no support issues

Why This Matters

Standard models (linear regression, t-tests) treat all zeros the same—as small values. But:

Structural zeros shouldn't be "converted" by treatment
Combining populations inflates variance
Mean is pulled toward zero, masking effects among actives

The Math Problem

For revenue with 80% zeros:

Mean = 0.2 × (mean among buyers)
Variance is huge (zero vs. non-zero dominates)
Standard error is enormous → low power

Approaches to Zero-Heavy Data

Approach 1: Analyze Separately

Split into two analyses:

Conversion: Did they have any positive value? (logistic regression)
Amount: Among positives, what was the value? (standard methods)

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm


def separate_analysis(data, value_col, group_col):
    """
    Analyze zeros and positives separately.
    """
    results = {}

    # Part 1: Conversion (any positive value)
    data['is_positive'] = (data[value_col] > 0).astype(int)

    groups = sorted(data[group_col].unique())
    conv_control = data.loc[data[group_col] == groups[0], 'is_positive']
    conv_treatment = data.loc[data[group_col] == groups[1], 'is_positive']

    # Z-test for proportions
    n_c, n_t = len(conv_control), len(conv_treatment)
    p_c, p_t = conv_control.mean(), conv_treatment.mean()

    p_pooled = (conv_control.sum() + conv_treatment.sum()) / (n_c + n_t)
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t))
    z_stat = (p_t - p_c) / se
    p_value_conv = 2 * (1 - stats.norm.cdf(abs(z_stat)))

    results['conversion'] = {
        'control_rate': p_c,
        'treatment_rate': p_t,
        'lift': (p_t - p_c) / p_c if p_c > 0 else np.inf,
        'p_value': p_value_conv
    }

    # Part 2: Amount among positives
    pos_control = data.loc[(data[group_col] == groups[0]) & (data[value_col] > 0), value_col]
    pos_treatment = data.loc[(data[group_col] == groups[1]) & (data[value_col] > 0), value_col]

    if len(pos_control) > 1 and len(pos_treatment) > 1:
        t_stat, p_value_amount = stats.ttest_ind(pos_control, pos_treatment)

        results['amount_given_positive'] = {
            'control_mean': pos_control.mean(),
            'treatment_mean': pos_treatment.mean(),
            'lift': (pos_treatment.mean() - pos_control.mean()) / pos_control.mean(),
            'p_value': p_value_amount
        }

    # Combined effect (for context)
    all_control = data.loc[data[group_col] == groups[0], value_col]
    all_treatment = data.loc[data[group_col] == groups[1], value_col]

    results['overall'] = {
        'control_mean': all_control.mean(),
        'treatment_mean': all_treatment.mean(),
        'lift': (all_treatment.mean() - all_control.mean()) / all_control.mean() if all_control.mean() > 0 else np.inf
    }

    return results


# Example
np.random.seed(42)
n = 5000

# Simulate revenue data
data = pd.DataFrame({
    'group': np.random.choice(['control', 'treatment'], n),
    'revenue': np.where(
        np.random.binomial(1, 0.2, n),  # 20% purchasers
        np.random.lognormal(3, 1, n),    # Revenue among purchasers
        0
    )
})

# Treatment increases conversion slightly
treatment_mask = data['group'] == 'treatment'
data.loc[treatment_mask, 'revenue'] = np.where(
    np.random.binomial(1, 0.22, treatment_mask.sum()),  # 22% conversion
    np.random.lognormal(3, 1, treatment_mask.sum()),
    0
)

results = separate_analysis(data, 'revenue', 'group')

print("Separate Analysis Results")
print("=" * 50)
print(f"\nConversion (any purchase):")
print(f"  Control: {results['conversion']['control_rate']:.1%}")
print(f"  Treatment: {results['conversion']['treatment_rate']:.1%}")
print(f"  Lift: {results['conversion']['lift']:.1%}")
print(f"  p-value: {results['conversion']['p_value']:.4f}")

if 'amount_given_positive' in results:
    print(f"\nAmount (among purchasers):")
    print(f"  Control: ${results['amount_given_positive']['control_mean']:.2f}")
    print(f"  Treatment: ${results['amount_given_positive']['treatment_mean']:.2f}")
    print(f"  Lift: {results['amount_given_positive']['lift']:.1%}")
    print(f"  p-value: {results['amount_given_positive']['p_value']:.4f}")

print(f"\nOverall (all users):")
print(f"  Control: ${results['overall']['control_mean']:.2f}")
print(f"  Treatment: ${results['overall']['treatment_mean']:.2f}")
print(f"  Lift: {results['overall']['lift']:.1%}")

Approach 2: Two-Part (Hurdle) Model

Formally model both parts together:

References

https://doi.org/10.1177/0962280212443324
https://doi.org/10.1002/sim.1401
https://pubs.aeaweb.org/doi/10.1257/jep.15.4.157
Mullahy, J. (1998). Much ado about two: reconsidering retransformation and the two-part model in health econometrics. *Journal of Health Economics*, 17(3), 247-281.
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. *Technometrics*, 34(1), 1-14.
Duan, N., Manning, W. G., Morris, C. N., & Newhouse, J. P. (1983). A comparison of alternative models for the demand for medical care. *Journal of Business & Economic Statistics*, 1(2), 115-126.

Frequently Asked Questions

When should I use a two-part model vs. zero-inflated model?

Two-part (hurdle) models assume any positive value comes from one process—you cross a hurdle, then the amount follows a distribution. Zero-inflated models assume zeros can come from two sources: always-zero (structural) or could-be-positive-but-happened-to-be-zero. Use two-part when the distinction is behavioral (buy vs. don't buy), zero-inflated when some units are fundamentally different (bots vs. real users).

Can I just analyze non-zero values separately?

Yes, this is often the right approach. Analyzing revenue among purchasers answers 'does treatment increase spend among buyers?' which may be exactly your question. Just be clear that you're conditioning on purchase. For total revenue effect, you need to combine purchase probability and amount.

How many zeros is 'too many' for standard methods?

There's no hard cutoff, but warnings apply when zeros exceed 20-30% of data. Above 50%, standard methods are usually inappropriate. The bigger issue is whether zeros represent a distinct population (structural) or just low probability events (sampling).

Key Takeaway

Zeros in product metrics often represent two distinct populations: users who will never engage (structural zeros) and users who could engage but didn't in this period (sampling zeros). Two-part models handle this by separately modeling if and how much. Zero-inflated models explicitly model the mixture. But simpler approaches—analyzing non-zeros separately or accepting higher variance—often suffice. Choose based on your business question: total effect or conditional effect.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email