Distributions

Dealing with Zeros: Zero-Inflated and Two-Part Models

How to handle metrics with many zeros—revenue from non-purchasers, engagement from inactive users, events that didn't happen. Learn when to use zero-inflated models, two-part models, and simpler alternatives.

Share

Quick Hits

  • Two populations: structural zeros (will never purchase) vs. sampling zeros (could but didn't)
  • Two-part models: separate probability of any value from amount given positive
  • Zero-inflated models: mixture of always-zero and standard distribution
  • Often simpler approaches work: analyze non-zeros separately, or just accept the variance
  • Choice depends on whether you care about total effect or conditional effect

TL;DR

Many product metrics have lots of zeros: revenue (non-purchasers), engagement (inactive users), support tickets (no issues). Standard methods struggle because zeros represent a distinct population, not just small values. Two-part models split the problem: model probability of any positive value, then model the amount given positive. Zero-inflated models treat some zeros as "structural" (will never be positive) and others as "sampling" (could be positive but weren't). Often, simpler approaches work: analyze positives separately, or use robust methods that handle the variance.


Why Zeros Are Different

Two Types of Zeros

Structural zeros: User fundamentally won't have this behavior

  • Free users who can't purchase (paywall)
  • Users in regions without the feature
  • Bot accounts (no real engagement)

Sampling zeros: User could have the behavior but didn't in this period

  • Purchaser who didn't buy this week
  • Active user who didn't engage today
  • Customer who had no support issues

Why This Matters

Standard models (linear regression, t-tests) treat all zeros the same—as small values. But:

  • Structural zeros shouldn't be "converted" by treatment
  • Combining populations inflates variance
  • Mean is pulled toward zero, masking effects among actives

The Math Problem

For revenue with 80% zeros:

  • Mean = 0.2 × (mean among buyers)
  • Variance is huge (zero vs. non-zero dominates)
  • Standard error is enormous → low power

Approaches to Zero-Heavy Data

Approach 1: Analyze Separately

Split into two analyses:

  1. Conversion: Did they have any positive value? (logistic regression)
  2. Amount: Among positives, what was the value? (standard methods)
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm


def separate_analysis(data, value_col, group_col):
    """
    Analyze zeros and positives separately.
    """
    results = {}

    # Part 1: Conversion (any positive value)
    data['is_positive'] = (data[value_col] > 0).astype(int)

    groups = sorted(data[group_col].unique())
    conv_control = data.loc[data[group_col] == groups[0], 'is_positive']
    conv_treatment = data.loc[data[group_col] == groups[1], 'is_positive']

    # Z-test for proportions
    n_c, n_t = len(conv_control), len(conv_treatment)
    p_c, p_t = conv_control.mean(), conv_treatment.mean()

    p_pooled = (conv_control.sum() + conv_treatment.sum()) / (n_c + n_t)
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t))
    z_stat = (p_t - p_c) / se
    p_value_conv = 2 * (1 - stats.norm.cdf(abs(z_stat)))

    results['conversion'] = {
        'control_rate': p_c,
        'treatment_rate': p_t,
        'lift': (p_t - p_c) / p_c if p_c > 0 else np.inf,
        'p_value': p_value_conv
    }

    # Part 2: Amount among positives
    pos_control = data.loc[(data[group_col] == groups[0]) & (data[value_col] > 0), value_col]
    pos_treatment = data.loc[(data[group_col] == groups[1]) & (data[value_col] > 0), value_col]

    if len(pos_control) > 1 and len(pos_treatment) > 1:
        t_stat, p_value_amount = stats.ttest_ind(pos_control, pos_treatment)

        results['amount_given_positive'] = {
            'control_mean': pos_control.mean(),
            'treatment_mean': pos_treatment.mean(),
            'lift': (pos_treatment.mean() - pos_control.mean()) / pos_control.mean(),
            'p_value': p_value_amount
        }

    # Combined effect (for context)
    all_control = data.loc[data[group_col] == groups[0], value_col]
    all_treatment = data.loc[data[group_col] == groups[1], value_col]

    results['overall'] = {
        'control_mean': all_control.mean(),
        'treatment_mean': all_treatment.mean(),
        'lift': (all_treatment.mean() - all_control.mean()) / all_control.mean() if all_control.mean() > 0 else np.inf
    }

    return results


# Example
np.random.seed(42)
n = 5000

# Simulate revenue data
data = pd.DataFrame({
    'group': np.random.choice(['control', 'treatment'], n),
    'revenue': np.where(
        np.random.binomial(1, 0.2, n),  # 20% purchasers
        np.random.lognormal(3, 1, n),    # Revenue among purchasers
        0
    )
})

# Treatment increases conversion slightly
treatment_mask = data['group'] == 'treatment'
data.loc[treatment_mask, 'revenue'] = np.where(
    np.random.binomial(1, 0.22, treatment_mask.sum()),  # 22% conversion
    np.random.lognormal(3, 1, treatment_mask.sum()),
    0
)

results = separate_analysis(data, 'revenue', 'group')

print("Separate Analysis Results")
print("=" * 50)
print(f"\nConversion (any purchase):")
print(f"  Control: {results['conversion']['control_rate']:.1%}")
print(f"  Treatment: {results['conversion']['treatment_rate']:.1%}")
print(f"  Lift: {results['conversion']['lift']:.1%}")
print(f"  p-value: {results['conversion']['p_value']:.4f}")

if 'amount_given_positive' in results:
    print(f"\nAmount (among purchasers):")
    print(f"  Control: ${results['amount_given_positive']['control_mean']:.2f}")
    print(f"  Treatment: ${results['amount_given_positive']['treatment_mean']:.2f}")
    print(f"  Lift: {results['amount_given_positive']['lift']:.1%}")
    print(f"  p-value: {results['amount_given_positive']['p_value']:.4f}")

print(f"\nOverall (all users):")
print(f"  Control: ${results['overall']['control_mean']:.2f}")
print(f"  Treatment: ${results['overall']['treatment_mean']:.2f}")
print(f"  Lift: {results['overall']['lift']:.1%}")

Approach 2: Two-Part (Hurdle) Model

Formally model both parts together:

$$P(Y = y) = \begin{cases} 1 - \pi & \text{if } y = 0 \ \pi \cdot f(y | y > 0) & \text{if } y > 0 \end{cases}$$

Where π is probability of positive value, f is distribution of positive values.

import statsmodels.api as sm
from scipy.optimize import minimize
from scipy import stats


class TwoPartModel:
    """
    Two-part model: logistic for zero/positive, then log-normal for positives.
    """

    def __init__(self):
        self.logit_model = None
        self.continuous_model = None

    def fit(self, X, y):
        """
        Fit two-part model.

        Parameters:
        -----------
        X : array-like
            Covariates (should include constant)
        y : array-like
            Response variable with zeros
        """
        X = np.asarray(X)
        y = np.asarray(y)

        # Part 1: Logistic for any positive
        y_binary = (y > 0).astype(int)
        self.logit_model = sm.Logit(y_binary, X).fit(disp=0)

        # Part 2: OLS on log of positive values
        positive_mask = y > 0
        if positive_mask.sum() > X.shape[1]:
            self.continuous_model = sm.OLS(
                np.log(y[positive_mask]),
                X[positive_mask]
            ).fit()

        return self

    def predict_probability(self, X):
        """Predict probability of positive value."""
        return self.logit_model.predict(X)

    def predict_conditional_mean(self, X):
        """Predict mean given positive (on original scale)."""
        log_mean = self.continuous_model.predict(X)
        # Smearing adjustment for log-normal
        sigma = np.sqrt(self.continuous_model.mse_resid)
        return np.exp(log_mean + sigma**2 / 2)

    def predict_unconditional_mean(self, X):
        """Predict overall mean (including zeros)."""
        prob = self.predict_probability(X)
        conditional = self.predict_conditional_mean(X)
        return prob * conditional

    def summary(self):
        """Print model summaries."""
        print("Part 1: Logistic (Probability of Positive)")
        print("=" * 50)
        print(self.logit_model.summary())
        print("\n")
        print("Part 2: Log-Linear (Amount | Positive)")
        print("=" * 50)
        print(self.continuous_model.summary())


# Example: Treatment effect on revenue
np.random.seed(42)
n = 2000

# Generate data
treatment = np.random.binomial(1, 0.5, n)
# Treatment increases conversion probability
prob_purchase = 0.15 + 0.05 * treatment
purchased = np.random.binomial(1, prob_purchase)
# Treatment also increases amount among purchasers
log_revenue = 3 + 0.2 * treatment + np.random.normal(0, 1, n)
revenue = np.where(purchased, np.exp(log_revenue), 0)

# Fit two-part model
X = sm.add_constant(treatment)
model = TwoPartModel()
model.fit(X, revenue)

print("Two-Part Model Results")
print("=" * 60)
model.summary()

# Treatment effects
print("\n" + "=" * 60)
print("Treatment Effects:")
print("-" * 60)

# On probability
logit_coef = model.logit_model.params[1]
logit_se = model.logit_model.bse[1]
odds_ratio = np.exp(logit_coef)
print(f"Probability of purchase:")
print(f"  Odds ratio: {odds_ratio:.3f}")
print(f"  95% CI: ({np.exp(logit_coef - 1.96*logit_se):.3f}, {np.exp(logit_coef + 1.96*logit_se):.3f})")

# On amount
if model.continuous_model is not None:
    cont_coef = model.continuous_model.params[1]
    cont_se = model.continuous_model.bse[1]
    multiplier = np.exp(cont_coef)
    print(f"\nAmount given purchase:")
    print(f"  Multiplier: {multiplier:.3f}")
    print(f"  95% CI: ({np.exp(cont_coef - 1.96*cont_se):.3f}, {np.exp(cont_coef + 1.96*cont_se):.3f})")

# Predicted means
X_control = np.array([[1, 0]])
X_treatment = np.array([[1, 1]])
mean_control = model.predict_unconditional_mean(X_control)[0]
mean_treatment = model.predict_unconditional_mean(X_treatment)[0]
print(f"\nPredicted overall revenue:")
print(f"  Control: ${mean_control:.2f}")
print(f"  Treatment: ${mean_treatment:.2f}")
print(f"  Lift: {(mean_treatment - mean_control) / mean_control:.1%}")

R Implementation

library(pscl)
library(MASS)

# Two-part model in R
fit_two_part <- function(data, formula_binary, formula_continuous) {
    # Part 1: Logistic
    data$is_positive <- as.integer(data$revenue > 0)
    logit_model <- glm(is_positive ~ treatment,
                       data = data, family = binomial)

    # Part 2: Log-linear among positives
    data_pos <- data[data$revenue > 0, ]
    lm_model <- lm(log(revenue) ~ treatment, data = data_pos)

    list(
        logit = logit_model,
        continuous = lm_model
    )
}

# Zero-inflated models
library(pscl)

# Zero-inflated Poisson (for counts)
zip_model <- zeroinfl(count ~ treatment | treatment,
                       data = data, dist = "poisson")

# Zero-inflated negative binomial
zinb_model <- zeroinfl(count ~ treatment | treatment,
                        data = data, dist = "negbin")

summary(zinb_model)

Zero-Inflated Models

The Mixture Concept

Zero-inflated models assume data comes from a mixture:

  • With probability ψ: always zero (structural)
  • With probability 1-ψ: follows count distribution (may also produce zeros)

$$P(Y = 0) = \psi + (1 - \psi) \cdot P(Y = 0 | \text{count model})$$ $$P(Y = k) = (1 - \psi) \cdot P(Y = k | \text{count model}), \quad k > 0$$

When to Use Zero-Inflated vs. Two-Part

Scenario Model Reasoning
Non-payers who could convert Two-part Zeros are "didn't buy yet"
Mix of bots and real users Zero-inflated Some are fundamentally different
Active users with no events Either Could go either way
Structural ineligibility Zero-inflated Can't have positive value

Code: Zero-Inflated Poisson

from scipy.optimize import minimize
from scipy.special import gammaln
import numpy as np


def fit_zip(y, X):
    """
    Fit zero-inflated Poisson model.

    Parameters:
    -----------
    y : array
        Count response
    X : array
        Covariates (with constant) for count model

    Returns dict with coefficients and standard errors.
    """
    n = len(y)
    k = X.shape[1]

    def neg_log_likelihood(params):
        # Parameters: first k are Poisson coefficients, last 1 is logit(psi)
        beta = params[:k]
        logit_psi = params[k]

        psi = 1 / (1 + np.exp(-logit_psi))  # Probability of structural zero
        lambda_ = np.exp(X @ beta)

        # Log-likelihood
        ll = 0
        for i in range(n):
            if y[i] == 0:
                # Zero can come from either component
                p_zero = psi + (1 - psi) * np.exp(-lambda_[i])
                ll += np.log(p_zero + 1e-10)
            else:
                # Positive must come from Poisson
                ll += np.log(1 - psi + 1e-10)
                ll += y[i] * np.log(lambda_[i]) - lambda_[i] - gammaln(y[i] + 1)

        return -ll

    # Initial values
    init_params = np.zeros(k + 1)
    init_params[0] = np.log(y[y > 0].mean() + 0.1)  # Intercept for Poisson
    init_params[k] = 0  # psi = 0.5

    # Optimize
    result = minimize(neg_log_likelihood, init_params, method='BFGS')

    # Extract results
    beta = result.x[:k]
    psi = 1 / (1 + np.exp(-result.x[k]))

    return {
        'poisson_coef': beta,
        'psi': psi,
        'converged': result.success,
        'log_likelihood': -result.fun
    }


# Example: Support ticket counts
np.random.seed(42)
n = 1000

treatment = np.random.binomial(1, 0.5, n)
# 30% are structural zeros (never submit tickets)
structural_zero = np.random.binomial(1, 0.3, n)
# Among others, Poisson counts
lambda_ = np.exp(0.5 - 0.3 * treatment)  # Treatment reduces tickets
counts = np.where(
    structural_zero,
    0,
    np.random.poisson(lambda_)
)

# Fit model
X = np.column_stack([np.ones(n), treatment])
results = fit_zip(counts, X)

print("Zero-Inflated Poisson Results")
print("=" * 50)
print(f"Structural zero probability (ψ): {results['psi']:.3f}")
print(f"Poisson intercept: {results['poisson_coef'][0]:.3f}")
print(f"Treatment effect on log-rate: {results['poisson_coef'][1]:.3f}")
print(f"Rate ratio: {np.exp(results['poisson_coef'][1]):.3f}")

Comparing Approaches

Simulation: Which Method Wins?

import numpy as np
import pandas as pd
from scipy import stats


def simulate_zero_heavy(n_per_group, zero_rate, true_effect_conv, true_effect_amount):
    """
    Simulate zero-heavy data with known treatment effects.
    """
    # Control group
    is_positive_c = np.random.binomial(1, 1 - zero_rate, n_per_group)
    amount_c = np.where(is_positive_c, np.random.lognormal(3, 1, n_per_group), 0)

    # Treatment group
    is_positive_t = np.random.binomial(1, (1 - zero_rate) * (1 + true_effect_conv), n_per_group)
    amount_t = np.where(
        is_positive_t,
        np.random.lognormal(3 + np.log(1 + true_effect_amount), 1, n_per_group),
        0
    )

    return amount_c, amount_t


def compare_methods(n_simulations=500):
    """
    Compare different methods for analyzing zero-heavy data.
    """
    np.random.seed(42)

    results = {
        'naive_ttest': [],
        'log_ttest': [],  # Log transform (excluding zeros)
        'wilcoxon': [],
        'separate_conv': [],
        'separate_amount': []
    }

    # True effects
    true_conv_effect = 0.10  # 10% lift in conversion
    true_amount_effect = 0.05  # 5% lift in amount among buyers

    for _ in range(n_simulations):
        control, treatment = simulate_zero_heavy(
            n_per_group=2000,
            zero_rate=0.8,  # 80% zeros
            true_effect_conv=true_conv_effect,
            true_effect_amount=true_amount_effect
        )

        # Method 1: Naive t-test
        _, p_naive = stats.ttest_ind(control, treatment)
        results['naive_ttest'].append(p_naive < 0.05)

        # Method 2: T-test on log (excluding zeros)
        log_c = np.log(control[control > 0])
        log_t = np.log(treatment[treatment > 0])
        _, p_log = stats.ttest_ind(log_c, log_t)
        results['log_ttest'].append(p_log < 0.05)

        # Method 3: Mann-Whitney
        _, p_mw = stats.mannwhitneyu(control, treatment, alternative='two-sided')
        results['wilcoxon'].append(p_mw < 0.05)

        # Method 4: Separate - conversion
        conv_c = (control > 0).mean()
        conv_t = (treatment > 0).mean()
        n = len(control)
        p_pool = (conv_c + conv_t) / 2
        se = np.sqrt(p_pool * (1 - p_pool) * 2 / n)
        z = (conv_t - conv_c) / se
        results['separate_conv'].append(abs(z) > 1.96)

        # Method 5: Separate - amount among positives
        _, p_amount = stats.ttest_ind(control[control > 0], treatment[treatment > 0])
        results['separate_amount'].append(p_amount < 0.05)

    # Power summary
    power_df = pd.DataFrame({
        'Method': ['Naive t-test', 'Log t-test (positives)', 'Mann-Whitney',
                   'Conversion test', 'Amount test (positives)'],
        'Power': [np.mean(results['naive_ttest']), np.mean(results['log_ttest']),
                  np.mean(results['wilcoxon']), np.mean(results['separate_conv']),
                  np.mean(results['separate_amount'])]
    })

    return power_df


power_results = compare_methods()
print("Power Comparison (500 simulations)")
print("True effects: +10% conversion, +5% amount among buyers")
print("Data: 80% zeros, n=2000 per group")
print("=" * 50)
print(power_results.to_string(index=False))

Results Interpretation

Typical findings:

  • Naive t-test: Low power (variance dominated by zeros)
  • Log t-test: Good power for amount effect, misses conversion effect
  • Mann-Whitney: Moderate power, tests stochastic dominance
  • Conversion test: High power for conversion effect
  • Amount test: High power for amount effect

Key insight: Separate analyses have higher power for their respective effects.


Decision Framework

START: Metric has many zeros
       ↓
QUESTION: What's your primary question?
├── Total effect (including zeros) → Continue
└── Effect among actives only → Analyze non-zeros directly
       ↓
QUESTION: Are zeros structural or sampling?
├── Structural (will never be positive) → Zero-inflated model
├── Sampling (could be positive) → Two-part/hurdle model
└── Unknown/Mixed → Two-part is usually safer
       ↓
QUESTION: How complex do you need?
├── Just want p-value → Separate analyses
├── Want combined inference → Formal two-part model
└── Want predictions → Full model with covariates
       ↓
VALIDATE: Check model fit
- Compare predicted vs. observed zero rate
- Check distribution of positive predictions
- Assess residuals in continuous part

Practical Recommendations

When Simple Works

  1. Low stakes, directional answer: Just compare means (accept variance)
  2. Clear separation: Analyze conversion and amount separately
  3. Only one part matters: Just test that part

When to Model

  1. Need combined confidence interval: Two-part model
  2. Covariates differ for zero vs. positive: Separate covariate effects
  3. Structural zeros identifiable: Zero-inflated model
  4. Prediction is goal: Full model specification

Reporting

Always disclose:

  1. What proportion of data are zeros
  2. Which model/approach you used and why
  3. Results for both parts if applicable
  4. Sensitivity to modeling choices


Key Takeaway

Zeros in product metrics often represent distinct populations—users who won't engage vs. users who could but didn't. Standard methods that treat zeros as "small values" lose power and can mislead. Two-part models separate the "if" from the "how much," giving you targeted effects for each. Zero-inflated models explicitly handle structural zeros. But don't overcomplicate: often analyzing conversion and amount separately answers your question more directly than a fancy model. Choose based on what effect you're trying to measure and communicate.


References

  1. https://doi.org/10.1177/0962280212443324
  2. https://doi.org/10.1002/sim.1401
  3. https://pubs.aeaweb.org/doi/10.1257/jep.15.4.157
  4. Mullahy, J. (1998). Much ado about two: reconsidering retransformation and the two-part model in health econometrics. *Journal of Health Economics*, 17(3), 247-281.
  5. Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. *Technometrics*, 34(1), 1-14.
  6. Duan, N., Manning, W. G., Morris, C. N., & Newhouse, J. P. (1983). A comparison of alternative models for the demand for medical care. *Journal of Business & Economic Statistics*, 1(2), 115-126.

Frequently Asked Questions

When should I use a two-part model vs. zero-inflated model?
Two-part (hurdle) models assume any positive value comes from one process—you cross a hurdle, then the amount follows a distribution. Zero-inflated models assume zeros can come from two sources: always-zero (structural) or could-be-positive-but-happened-to-be-zero. Use two-part when the distinction is behavioral (buy vs. don't buy), zero-inflated when some units are fundamentally different (bots vs. real users).
Can I just analyze non-zero values separately?
Yes, this is often the right approach. Analyzing revenue among purchasers answers 'does treatment increase spend among buyers?' which may be exactly your question. Just be clear that you're conditioning on purchase. For total revenue effect, you need to combine purchase probability and amount.
How many zeros is 'too many' for standard methods?
There's no hard cutoff, but warnings apply when zeros exceed 20-30% of data. Above 50%, standard methods are usually inappropriate. The bigger issue is whether zeros represent a distinct population (structural) or just low probability events (sampling).

Key Takeaway

Zeros in product metrics often represent two distinct populations: users who will never engage (structural zeros) and users who could engage but didn't in this period (sampling zeros). Two-part models handle this by separately modeling if and how much. Zero-inflated models explicitly model the mixture. But simpler approaches—analyzing non-zeros separately or accepting higher variance—often suffice. Choose based on your business question: total effect or conditional effect.

Send to a friend

Share this with someone who loves clean statistical work.