Survival Analysis

Time-to-Event and Retention Analysis: Survival Methods for Tech

A comprehensive guide to survival analysis for product analysts. Learn Kaplan-Meier curves for retention, log-rank tests for comparing groups, Cox regression for understanding drivers, and how to handle the unique challenges of tech product data.

Share

Quick Hits

  • Survival analysis handles 'how long until event?' with incomplete data
  • Censoring = we know they haven't churned YET, but don't know their final outcome
  • Kaplan-Meier curves show retention over time, handling censoring properly
  • Log-rank test compares survival curves between groups
  • Cox regression identifies factors that increase/decrease hazard (risk of event)

TL;DR

Survival analysis answers "how long until an event?" when not everyone has experienced the event yet. In tech, this means retention curves, time-to-conversion, time-to-feature-adoption, and more. The key challenge is censoring—users who haven't churned yet still provide information. Kaplan-Meier curves visualize survival over time, log-rank tests compare groups, and Cox regression identifies what drives faster or slower events.


Why Standard Methods Fail

The Problem

You want to know: "What's the typical time until users churn?"

Naive approach: Average time among users who churned.

Problem: This excludes users who haven't churned yet (censored), biasing your estimate.

Example:

  • User A: Churned at day 30
  • User B: Churned at day 45
  • User C: Still active at day 60

Naive average: (30 + 45) / 2 = 37.5 days

But User C has already survived 60 days! They'll push the average higher when they eventually churn. Excluding them underestimates retention.

Survival Analysis Solution

Use all available information:

  • User A: Event at day 30
  • User B: Event at day 45
  • User C: Censored at day 60 (survived at least 60 days)

Survival methods incorporate partial information from censored observations.


Core Concepts

The Event

The outcome you're tracking:

  • Churn: User stops using the product
  • Conversion: Free user becomes paid
  • Feature adoption: User tries a specific feature
  • Return: User comes back after inactivity

Time Origin

When the clock starts:

  • Signup date: Time since registration
  • Feature launch: Time since feature became available
  • Experiment start: Time since randomization

Censoring

Censoring occurs when you don't observe the event:

Right censoring (most common):

  • User hasn't churned by end of study
  • User dropped out of observation (e.g., deleted account for unrelated reason)

Left censoring (rare):

  • Event happened before observation started
  • "How long since last purchase?" but user already purchased before tracking began

Interval censoring:

  • Event happened between observations
  • User was active Monday, churned by Friday (exact day unknown)

The Survival Function

Definition

$$S(t) = P(\text{survive beyond time } t) = P(T > t)$$

  • S(0) = 1 (everyone starts alive/retained)
  • S(t) decreases over time (or stays flat)
  • As t → ∞, S(t) → 0 (eventually everyone experiences the event)

The Hazard Function

$$h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t}$$

Interpretation: Instantaneous risk of the event at time t, given survival to time t.

  • High hazard = high risk of event right now
  • Hazard can increase, decrease, or stay constant over time

Hazard vs. Survival

$$S(t) = \exp\left(-\int_0^t h(u) du\right)$$

If you know hazard over time, you can derive survival (and vice versa).


Kaplan-Meier Estimation

The Method

Kaplan-Meier estimates the survival curve non-parametrically (no distributional assumptions).

$$\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)$$

Where:

  • $t_i$ = time of the i-th event
  • $d_i$ = number of events at time $t_i$
  • $n_i$ = number at risk just before time $t_i$

Interpretation

At each event time, multiply the previous survival by (1 - event rate at that moment).

Example:

Day At Risk Events Survival
0 100 0 1.000
7 100 5 1 × (1 - 5/100) = 0.950
14 90 10 0.95 × (1 - 10/90) = 0.844
21 75 8 0.844 × (1 - 8/75) = 0.754

Handling Censoring

Censored observations reduce the "at risk" count but don't create a step down in survival.

If 5 users are censored between day 14 and 21: At risk drops from 90-10=80 to 80-5=75 before day 21's events.


Comparing Survival Curves

Log-Rank Test

Tests whether survival curves differ between groups.

$$H_0: S_1(t) = S_2(t) \text{ for all } t$$

How it works: At each event time, compares observed vs. expected events in each group. Sums across all times.

Interpretation:

  • Significant p-value → curves differ somewhere
  • Doesn't specify where or how (early vs. late divergence)

Other Tests

Test Best When
Log-rank Proportional hazards (curves don't cross)
Wilcoxon (Breslow) Early differences matter more
Tarone-Ware Compromise between log-rank and Wilcoxon
Peto-Peto Heavy censoring

Cox Proportional Hazards Regression

The Model

$$h(t|X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + ...)$$

Where:

  • $h(t|X)$ = hazard at time t given covariates X
  • $h_0(t)$ = baseline hazard (unspecified shape)
  • $\beta_j$ = log hazard ratios

Hazard Ratio Interpretation

$$HR = e^{\beta_j}$$

Interpretation: For a one-unit increase in $X_j$, the hazard is multiplied by HR.

  • HR = 1: No effect
  • HR > 1: Higher hazard → faster events (worse retention)
  • HR < 1: Lower hazard → slower events (better retention)

Example: If premium users have HR = 0.6 for churn:

  • Premium users have 40% lower hazard of churning at any given time
  • They churn more slowly

The Proportional Hazards Assumption

The model assumes hazard ratios are constant over time.

Example (valid): Premium reduces hazard by 40% at day 7, day 30, day 90...

Example (violation): Premium reduces hazard by 40% early but no effect after day 90.

Check: Plot Schoenfeld residuals or log(-log(S(t))) curves.


Code: Complete Survival Analysis

Python

import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt


def generate_survival_data(n=500, seed=42):
    """Generate example survival data with censoring."""
    np.random.seed(seed)

    # Covariates
    premium = np.random.binomial(1, 0.3, n)
    tenure_months = np.random.exponential(12, n)
    engagement_score = np.random.normal(50, 15, n)

    # True hazard model
    # Premium users have lower hazard (HR=0.6), higher engagement = lower hazard
    log_hazard = -3 + (-0.5) * premium + (-0.02) * engagement_score + 0.01 * tenure_months
    hazard = np.exp(log_hazard)

    # Generate survival times (exponential baseline)
    survival_time = np.random.exponential(1/hazard)

    # Censoring (administrative at day 180)
    observed_time = np.minimum(survival_time, 180)
    event = (survival_time <= 180).astype(int)

    return pd.DataFrame({
        'time': observed_time,
        'event': event,
        'premium': premium,
        'tenure_months': tenure_months,
        'engagement_score': engagement_score
    })


def kaplan_meier_analysis(data, time_col='time', event_col='event',
                          group_col=None, figsize=(10, 6)):
    """
    Kaplan-Meier survival analysis.

    Parameters:
    -----------
    data : pd.DataFrame
        Dataset with time and event columns
    time_col : str
        Column name for time variable
    event_col : str
        Column name for event indicator (1=event, 0=censored)
    group_col : str, optional
        Column name for grouping variable

    Returns:
    --------
    dict with KM results and figure
    """
    fig, ax = plt.subplots(figsize=figsize)
    results = {}

    if group_col is None:
        # Single survival curve
        kmf = KaplanMeierFitter()
        kmf.fit(data[time_col], data[event_col], label='Overall')
        kmf.plot_survival_function(ax=ax, ci_show=True)

        results['kmf'] = kmf
        results['median_survival'] = kmf.median_survival_time_
        results['survival_at_30'] = kmf.survival_function_at_times(30).values[0]
        results['survival_at_90'] = kmf.survival_function_at_times(90).values[0]

    else:
        # Survival curves by group
        groups = data[group_col].unique()
        kmf_dict = {}

        for group in sorted(groups):
            subset = data[data[group_col] == group]
            kmf = KaplanMeierFitter()
            kmf.fit(subset[time_col], subset[event_col], label=f'{group_col}={group}')
            kmf.plot_survival_function(ax=ax, ci_show=True)
            kmf_dict[group] = kmf

        results['kmf_by_group'] = kmf_dict

        # Log-rank test
        if len(groups) == 2:
            g1, g2 = sorted(groups)
            lr_result = logrank_test(
                data[data[group_col] == g1][time_col],
                data[data[group_col] == g2][time_col],
                data[data[group_col] == g1][event_col],
                data[data[group_col] == g2][event_col]
            )
            results['logrank'] = {
                'test_statistic': lr_result.test_statistic,
                'p_value': lr_result.p_value
            }

    ax.set_xlabel('Time (days)')
    ax.set_ylabel('Survival Probability')
    ax.set_title('Kaplan-Meier Survival Curve')
    ax.legend(loc='lower left')
    ax.set_ylim(0, 1)

    results['figure'] = fig
    return results


def cox_regression_analysis(data, time_col='time', event_col='event',
                            covariates=None):
    """
    Cox proportional hazards regression.

    Parameters:
    -----------
    data : pd.DataFrame
        Dataset
    time_col : str
        Time variable
    event_col : str
        Event indicator
    covariates : list
        List of covariate column names

    Returns:
    --------
    dict with Cox model results
    """
    if covariates is None:
        covariates = [c for c in data.columns if c not in [time_col, event_col]]

    # Prepare data
    model_data = data[[time_col, event_col] + covariates].copy()

    # Fit Cox model
    cph = CoxPHFitter()
    cph.fit(model_data, duration_col=time_col, event_col=event_col)

    # Extract hazard ratios
    hr_df = pd.DataFrame({
        'Variable': cph.params_.index,
        'Coefficient': cph.params_.values,
        'Hazard Ratio': np.exp(cph.params_.values),
        'HR CI Lower': np.exp(cph.confidence_intervals_['95% lower-bound'].values),
        'HR CI Upper': np.exp(cph.confidence_intervals_['95% upper-bound'].values),
        'P-value': cph.summary['p'].values
    })

    # Check proportional hazards
    ph_test = cph.check_assumptions(model_data, show_plots=False, p_value_threshold=0.05)

    return {
        'model': cph,
        'hazard_ratios': hr_df,
        'summary': cph.summary,
        'concordance': cph.concordance_index_,
        'ph_assumption_test': ph_test
    }


# Example usage
if __name__ == "__main__":
    # Generate data
    data = generate_survival_data(n=500)

    print("Survival Analysis Example")
    print("=" * 60)
    print(f"Sample size: {len(data)}")
    print(f"Events observed: {data['event'].sum()}")
    print(f"Censoring rate: {1 - data['event'].mean():.1%}")

    # Kaplan-Meier overall
    print("\n" + "=" * 60)
    print("Kaplan-Meier (Overall)")
    km_results = kaplan_meier_analysis(data)
    print(f"Median survival time: {km_results['median_survival']:.1f} days")
    print(f"30-day survival: {km_results['survival_at_30']:.1%}")
    print(f"90-day survival: {km_results['survival_at_90']:.1%}")

    # Kaplan-Meier by premium
    print("\n" + "=" * 60)
    print("Kaplan-Meier (by Premium Status)")
    km_group = kaplan_meier_analysis(data, group_col='premium')
    print(f"Log-rank test p-value: {km_group['logrank']['p_value']:.4f}")

    # Cox regression
    print("\n" + "=" * 60)
    print("Cox Proportional Hazards Regression")
    cox_results = cox_regression_analysis(
        data,
        covariates=['premium', 'tenure_months', 'engagement_score']
    )
    print("\nHazard Ratios:")
    print(cox_results['hazard_ratios'].to_string(index=False))
    print(f"\nConcordance index: {cox_results['concordance']:.3f}")

    plt.show()

R

library(tidyverse)
library(survival)
library(survminer)


generate_survival_data <- function(n = 500, seed = 42) {
    #' Generate example survival data

    set.seed(seed)

    premium <- rbinom(n, 1, 0.3)
    tenure_months <- rexp(n, 1/12)
    engagement_score <- rnorm(n, 50, 15)

    # True hazard
    log_hazard <- -3 + (-0.5) * premium + (-0.02) * engagement_score + 0.01 * tenure_months
    hazard <- exp(log_hazard)

    # Survival times
    survival_time <- rexp(n, hazard)

    # Censoring at 180 days
    observed_time <- pmin(survival_time, 180)
    event <- as.integer(survival_time <= 180)

    tibble(
        time = observed_time,
        event = event,
        premium = premium,
        tenure_months = tenure_months,
        engagement_score = engagement_score
    )
}


# Generate data
data <- generate_survival_data(500)

cat("Survival Analysis Example\n")
cat(strrep("=", 60), "\n")
cat(sprintf("Sample size: %d\n", nrow(data)))
cat(sprintf("Events observed: %d\n", sum(data$event)))
cat(sprintf("Censoring rate: %.1f%%\n", (1 - mean(data$event)) * 100))

# Create survival object
surv_obj <- Surv(data$time, data$event)

# Kaplan-Meier (overall)
cat("\n", strrep("=", 60), "\n")
cat("Kaplan-Meier (Overall)\n")
km_fit <- survfit(surv_obj ~ 1, data = data)
print(summary(km_fit, times = c(30, 60, 90, 180)))

# Plot
ggsurvplot(km_fit, data = data,
           conf.int = TRUE,
           risk.table = TRUE,
           xlab = "Time (days)",
           ylab = "Survival Probability")

# Kaplan-Meier by premium
cat("\n", strrep("=", 60), "\n")
cat("Kaplan-Meier by Premium Status\n")
km_premium <- survfit(surv_obj ~ premium, data = data)

ggsurvplot(km_premium, data = data,
           conf.int = TRUE,
           risk.table = TRUE,
           pval = TRUE,
           legend.labs = c("Non-Premium", "Premium"),
           xlab = "Time (days)")

# Log-rank test
cat("\nLog-rank test:\n")
lr_test <- survdiff(surv_obj ~ premium, data = data)
print(lr_test)

# Cox regression
cat("\n", strrep("=", 60), "\n")
cat("Cox Proportional Hazards Regression\n")
cox_model <- coxph(surv_obj ~ premium + tenure_months + engagement_score, data = data)
print(summary(cox_model))

# Hazard ratios with CI
hr_table <- tibble(
    Variable = names(coef(cox_model)),
    `Hazard Ratio` = exp(coef(cox_model)),
    `HR CI Lower` = exp(confint(cox_model))[, 1],
    `HR CI Upper` = exp(confint(cox_model))[, 2],
    `P-value` = summary(cox_model)$coefficients[, "Pr(>|z|)"]
)
print(hr_table)

# Check proportional hazards
cat("\nProportional Hazards Test:\n")
ph_test <- cox.zph(cox_model)
print(ph_test)

Product Analytics Applications

1. Retention Curves

Question: What's our D7, D30, D90 retention?

Approach: Kaplan-Meier with churn as event, signup as time origin.

2. Time to First Purchase

Question: How long until new users make their first purchase?

Approach: Survival analysis with purchase as event. Many users never purchase (heavy censoring)—that's fine.

3. A/B Test on Retention

Question: Does the new onboarding improve retention?

Approach: Kaplan-Meier by treatment group + log-rank test. Cox regression if you need to control for covariates.

4. Churn Risk Factors

Question: What predicts faster churn?

Approach: Cox regression with user characteristics as covariates. Hazard ratios show relative risk.

5. Feature Adoption

Question: How quickly do users adopt Feature X?

Approach: Time-to-adoption analysis. Users who never adopt are censored.


Common Pitfalls

Pitfall 1: Ignoring Censoring

Wrong: Calculate average time-to-churn only among churned users.

Right: Use Kaplan-Meier or Cox regression, which properly handles censored observations.

Pitfall 2: Treating Survival at Day X as a Proportion

Wrong: "50% of users are retained at day 30" calculated as (users active at day 30) / (total users)

Problem: Doesn't account for users who signed up recently and haven't had 30 days yet.

Right: Use Kaplan-Meier estimate at day 30, which handles varying observation periods.

Pitfall 3: Violating Proportional Hazards

Wrong: Use Cox regression when hazard ratios change over time.

Right: Check proportional hazards assumption. If violated, consider stratified Cox, time-varying coefficients, or parametric models.

Pitfall 4: Immortal Time Bias

Wrong: Compare survival of users who "adopted Feature X" vs. those who didn't, with adoption measured at any time.

Problem: Users must survive long enough to adopt the feature. Attributing their pre-adoption survival to the feature biases results.

Right: Use time-varying covariates (adoption status changes from 0 to 1 at adoption time) or landmark analysis.


Core Methods

Interpretation

Extensions


Key Takeaway

Survival analysis is the right tool whenever you're asking "how long until X?" with incomplete data. It correctly handles users who haven't experienced the event yet (censoring), enabling valid retention curves, time-to-event comparisons, and risk factor analysis. Kaplan-Meier gives you the survival curve, log-rank tests compare groups, and Cox regression quantifies risk factors. The key concepts—censoring, hazard, and the proportional hazards assumption—are fundamental to doing this right.


References

  1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3059453/
  2. https://lifelines.readthedocs.io/
  3. https://cran.r-project.org/web/packages/survival/vignettes/survival.pdf
  4. Clark, T. G., Bradburn, M. J., Love, S. B., & Altman, D. G. (2003). Survival analysis part I: basic concepts and first analyses. *British Journal of Cancer*, 89(2), 232-238.
  5. Davidson-Pilon, C. (2019). lifelines: survival analysis in Python. *Journal of Open Source Software*, 4(40), 1317.
  6. Therneau, T. M., & Grambsch, P. M. (2000). Modeling survival data: extending the Cox model. Springer.

Frequently Asked Questions

Why can't I just use the average time to churn?
Because of censoring. Many users haven't churned yet—you don't know their eventual outcome. Excluding them biases your estimate low. Survival methods use partial information from censored observations correctly.
What's the difference between survival and retention?
They're complements. Survival = probability of NOT having the event (not churning, not converting, not failing). Retention is typically survival from churn. S(t) + F(t) = 1, where F(t) is the cumulative failure probability.
When should I use Kaplan-Meier vs. Cox regression?
Kaplan-Meier for visualization and simple group comparisons. Cox regression when you want to understand which factors affect survival, control for covariates, or quantify effect sizes as hazard ratios.

Key Takeaway

Survival analysis is essential whenever you're asking 'how long until X happens?' in the presence of incomplete data. It correctly handles users who haven't experienced the event yet (censoring), compares time-to-event across groups, and identifies factors that accelerate or delay events.

Send to a friend

Share this with someone who loves clean statistical work.