Survival Analysis

Comparing Retention Curves Across Segments: Multiplicity and Visualization

A practical guide to comparing survival curves across multiple segments. Learn how to visualize multiple retention curves, handle multiple comparisons, and communicate segment differences clearly.

Share

Quick Hits

  • Multiple pairwise comparisons inflate false positive rate - 3 groups = 3 tests = ~14% false positive risk
  • Use overall log-rank first, then pairwise with adjustment if significant
  • Bonferroni is conservative; Holm is better; consider FDR for exploration
  • Visualize carefully: too many curves are unreadable; use facets or highlight key comparisons
  • Report effect sizes (median difference, HR) not just p-values

TL;DR

Comparing retention curves across segments requires handling multiple comparisons properly. Testing 4 segments means 6 pairwise comparisons—without adjustment, you have a ~26% chance of at least one false positive. Use a global test first (multi-group log-rank), then pairwise comparisons with Bonferroni or Holm adjustment. For visualization, less is more: too many curves obscure patterns. Report effect sizes (median difference, hazard ratios) alongside p-values.


The Multiple Comparisons Problem

The Math

With k groups, the number of pairwise comparisons is: $$\binom{k}{2} = \frac{k(k-1)}{2}$$

Groups Pairwise Tests False Positive Rate (α=0.05)
2 1 5.0%
3 3 14.3%
4 6 26.5%
5 10 40.1%

Without adjustment, you're increasingly likely to find "significant" differences that aren't real.

Solutions

  1. Overall test first: Test whether any differences exist before pairwise
  2. P-value adjustment: Bonferroni, Holm, or FDR
  3. Pre-specification: Only test planned comparisons

The Two-Stage Approach

Stage 1: Overall Test

Test whether the k survival curves are all equal:

$$H_0: S_1(t) = S_2(t) = ... = S_k(t) \text{ for all } t$$

Use the multi-group log-rank test (k-1 degrees of freedom).

If p < 0.05: Proceed to pairwise comparisons If p ≥ 0.05: Stop—no evidence of differences

Stage 2: Pairwise Comparisons with Adjustment

Compare pairs of interest with multiplicity adjustment.


P-Value Adjustment Methods

Bonferroni

Multiply p-values by number of tests (or divide α by number of tests).

$$p_{adjusted} = p_{raw} \times m$$

Pros: Simple, controls family-wise error rate Cons: Conservative (may miss real effects)

Holm (Step-Down)

More powerful than Bonferroni while still controlling family-wise error:

  1. Order p-values: $p_{(1)} \leq p_{(2)} \leq ... \leq p_{(m)}$
  2. Adjusted $p_{(i)} = \max_{j \leq i}((m - j + 1) \times p_{(j)})$

Pros: Uniformly more powerful than Bonferroni Cons: Slightly more complex

Benjamini-Hochberg (FDR)

Controls false discovery rate instead of family-wise error:

  1. Order p-values
  2. Adjusted $p_{(i)} = \min_{j \geq i}(m/j \times p_{(j)})$

Pros: More discoveries, appropriate for exploration Cons: Accepts some false positives

Recommendation

Goal Method
Confirmatory (control FWER) Holm
Exploratory (allow some FP) Benjamini-Hochberg
Simple/conservative Bonferroni
All vs. control Dunnett

Code: Multiple Segment Comparison

Python

import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test, multivariate_logrank_test
from itertools import combinations
import matplotlib.pyplot as plt


def compare_multiple_segments(data, time_col, event_col, segment_col,
                              adjustment='holm', alpha=0.05):
    """
    Compare survival curves across multiple segments with multiplicity adjustment.

    Parameters:
    -----------
    data : pd.DataFrame
        Dataset
    time_col : str
        Time variable
    event_col : str
        Event indicator
    segment_col : str
        Segment variable
    adjustment : str
        'bonferroni', 'holm', or 'fdr'
    alpha : float
        Significance level

    Returns:
    --------
    dict with overall test, pairwise tests, and adjusted p-values
    """
    results = {}
    segments = sorted(data[segment_col].unique())
    n_segments = len(segments)

    # Stage 1: Overall test
    overall = multivariate_logrank_test(
        data[time_col],
        data[segment_col],
        data[event_col]
    )

    results['overall'] = {
        'test_statistic': overall.test_statistic,
        'p_value': overall.p_value,
        'df': n_segments - 1,
        'significant': overall.p_value < alpha
    }

    # Stage 2: Pairwise comparisons
    pairs = list(combinations(segments, 2))
    n_comparisons = len(pairs)

    pairwise = []
    for seg1, seg2 in pairs:
        mask1 = data[segment_col] == seg1
        mask2 = data[segment_col] == seg2

        lr = logrank_test(
            data.loc[mask1, time_col],
            data.loc[mask2, time_col],
            data.loc[mask1, event_col],
            data.loc[mask2, event_col]
        )

        pairwise.append({
            'segment1': seg1,
            'segment2': seg2,
            'test_statistic': lr.test_statistic,
            'p_value': lr.p_value
        })

    pairwise_df = pd.DataFrame(pairwise)

    # Adjust p-values
    raw_p = pairwise_df['p_value'].values

    if adjustment == 'bonferroni':
        adj_p = np.minimum(raw_p * n_comparisons, 1.0)

    elif adjustment == 'holm':
        # Sort p-values
        order = np.argsort(raw_p)
        adj_p = np.zeros(n_comparisons)
        for i, idx in enumerate(order):
            multiplier = n_comparisons - i
            adj_p[idx] = min(raw_p[idx] * multiplier, 1.0)
        # Enforce monotonicity
        sorted_adj = adj_p[order]
        for i in range(1, len(sorted_adj)):
            sorted_adj[i] = max(sorted_adj[i], sorted_adj[i-1])
        adj_p[order] = sorted_adj

    elif adjustment == 'fdr':
        # Benjamini-Hochberg
        order = np.argsort(raw_p)
        adj_p = np.zeros(n_comparisons)
        ranks = np.empty_like(order)
        ranks[order] = np.arange(1, n_comparisons + 1)
        adj_p = raw_p * n_comparisons / ranks
        adj_p = np.minimum(adj_p, 1.0)
        # Enforce monotonicity (descending in sorted order)
        sorted_adj = adj_p[order[::-1]]
        for i in range(1, len(sorted_adj)):
            sorted_adj[i] = min(sorted_adj[i], sorted_adj[i-1])
        adj_p[order[::-1]] = sorted_adj

    pairwise_df['p_adjusted'] = adj_p
    pairwise_df['significant'] = pairwise_df['p_adjusted'] < alpha

    results['pairwise'] = pairwise_df
    results['n_comparisons'] = n_comparisons
    results['adjustment'] = adjustment

    return results


def plot_segment_curves(data, time_col, event_col, segment_col,
                        title='Retention by Segment', figsize=(10, 6),
                        highlight=None):
    """
    Plot Kaplan-Meier curves for multiple segments.

    Parameters:
    -----------
    highlight : list, optional
        Segments to highlight (others will be grayed)
    """
    fig, ax = plt.subplots(figsize=figsize)

    segments = sorted(data[segment_col].unique())
    colors = plt.cm.tab10(np.linspace(0, 1, len(segments)))

    for i, segment in enumerate(segments):
        mask = data[segment_col] == segment
        kmf = KaplanMeierFitter()
        kmf.fit(data.loc[mask, time_col], data.loc[mask, event_col],
                label=str(segment))

        # Determine style
        if highlight is None:
            alpha = 0.8
            linewidth = 2
            color = colors[i]
        elif segment in highlight:
            alpha = 0.9
            linewidth = 2.5
            color = colors[i]
        else:
            alpha = 0.3
            linewidth = 1
            color = 'gray'

        kmf.plot_survival_function(ax=ax, ci_show=False,
                                   color=color, alpha=alpha,
                                   linewidth=linewidth)

    ax.set_xlabel('Time')
    ax.set_ylabel('Survival Probability')
    ax.set_title(title)
    ax.legend(title=segment_col, loc='lower left')
    ax.set_ylim(0, 1)

    return fig


def create_segment_comparison_table(data, time_col, event_col, segment_col,
                                    reference=None):
    """
    Create summary table comparing segments.
    """
    segments = sorted(data[segment_col].unique())
    if reference is None:
        reference = segments[0]

    rows = []
    ref_kmf = KaplanMeierFitter()
    ref_mask = data[segment_col] == reference
    ref_kmf.fit(data.loc[ref_mask, time_col], data.loc[ref_mask, event_col])

    for segment in segments:
        mask = data[segment_col] == segment
        kmf = KaplanMeierFitter()
        kmf.fit(data.loc[mask, time_col], data.loc[mask, event_col])

        # Median survival
        median = kmf.median_survival_time_

        # Survival at key times
        s30 = kmf.survival_function_at_times(30).values[0] if 30 <= data[time_col].max() else None
        s60 = kmf.survival_function_at_times(60).values[0] if 60 <= data[time_col].max() else None
        s90 = kmf.survival_function_at_times(90).values[0] if 90 <= data[time_col].max() else None

        rows.append({
            'Segment': segment,
            'N': mask.sum(),
            'Events': data.loc[mask, event_col].sum(),
            'Median': median,
            'D30': f"{s30:.1%}" if s30 else None,
            'D60': f"{s60:.1%}" if s60 else None,
            'D90': f"{s90:.1%}" if s90 else None
        })

    return pd.DataFrame(rows)


# Example
if __name__ == "__main__":
    np.random.seed(42)
    n = 800

    # Generate data with 4 segments
    segments = ['Basic', 'Plus', 'Premium', 'Enterprise']
    hazards = {'Basic': 0.03, 'Plus': 0.02, 'Premium': 0.015, 'Enterprise': 0.01}

    data = pd.DataFrame({
        'segment': np.random.choice(segments, n),
    })
    data['time'] = data['segment'].map(lambda s: np.random.exponential(1/hazards[s]))
    data['time'] = np.minimum(data['time'], 180)
    data['event'] = np.random.binomial(1, 0.7, n)

    # Overall and pairwise tests
    results = compare_multiple_segments(data, 'time', 'event', 'segment',
                                        adjustment='holm')

    print("Segment Comparison Results")
    print("=" * 60)
    print(f"\nOverall Log-Rank Test:")
    print(f"  χ² = {results['overall']['test_statistic']:.2f}, "
          f"df = {results['overall']['df']}, "
          f"p = {results['overall']['p_value']:.4f}")

    print(f"\nPairwise Comparisons ({results['n_comparisons']} tests, {results['adjustment']} adjustment):")
    print(results['pairwise'][['segment1', 'segment2', 'p_value', 'p_adjusted', 'significant']]
          .to_string(index=False))

    # Summary table
    print("\n" + "=" * 60)
    print("Segment Summary:")
    print(create_segment_comparison_table(data, 'time', 'event', 'segment').to_string(index=False))

    # Plot
    fig = plot_segment_curves(data, 'time', 'event', 'segment')
    plt.tight_layout()
    plt.show()

R

library(tidyverse)
library(survival)
library(survminer)


compare_multiple_segments <- function(data, time_col, event_col, segment_col,
                                       adjustment = "holm") {
    #' Compare survival curves with multiplicity adjustment

    # Create survival object
    surv_obj <- Surv(data[[time_col]], data[[event_col]])

    # Overall test
    formula <- as.formula(paste("surv_obj ~", segment_col))
    overall_test <- survdiff(formula, data = data)
    overall_p <- pchisq(overall_test$chisq, df = length(overall_test$n) - 1,
                        lower.tail = FALSE)

    # Pairwise tests
    segments <- sort(unique(data[[segment_col]]))
    pairs <- combn(segments, 2, simplify = FALSE)

    pairwise <- map_dfr(pairs, function(pair) {
        subset <- data[data[[segment_col]] %in% pair, ]
        test <- survdiff(formula, data = subset)
        p <- pchisq(test$chisq, df = 1, lower.tail = FALSE)
        tibble(
            segment1 = pair[1],
            segment2 = pair[2],
            p_value = p
        )
    })

    # Adjust p-values
    pairwise$p_adjusted <- p.adjust(pairwise$p_value, method = adjustment)
    pairwise$significant <- pairwise$p_adjusted < 0.05

    list(
        overall = list(
            chisq = overall_test$chisq,
            df = length(overall_test$n) - 1,
            p_value = overall_p
        ),
        pairwise = pairwise,
        adjustment = adjustment
    )
}


# Example
set.seed(42)
n <- 800

data <- tibble(
    segment = sample(c("Basic", "Plus", "Premium", "Enterprise"), n, replace = TRUE)
) %>%
    mutate(
        hazard = case_when(
            segment == "Basic" ~ 0.03,
            segment == "Plus" ~ 0.02,
            segment == "Premium" ~ 0.015,
            segment == "Enterprise" ~ 0.01
        ),
        time = rexp(n, hazard),
        time = pmin(time, 180),
        event = rbinom(n, 1, 0.7)
    )

# Compare
results <- compare_multiple_segments(data, "time", "event", "segment")

cat("Overall Test:\n")
cat(sprintf("χ² = %.2f, df = %d, p = %.4f\n",
            results$overall$chisq, results$overall$df, results$overall$p_value))

cat("\nPairwise Comparisons (Holm adjusted):\n")
print(results$pairwise)

# Plot
surv_obj <- Surv(data$time, data$event)
fit <- survfit(surv_obj ~ segment, data = data)

ggsurvplot(
    fit,
    data = data,
    pval = TRUE,
    risk.table = TRUE,
    palette = "jco"
)

Visualization Best Practices

Rule 1: Limit the Number of Curves

More than 4-5 curves become hard to distinguish.

Solutions:

  • Group similar segments
  • Use faceted plots
  • Highlight key comparisons, gray out others

Rule 2: Use Clear Color Coding

  • Consistent colors across all visualizations
  • Color-blind friendly palettes (avoid red-green)
  • Consider using line styles for additional differentiation

Rule 3: Add Context

Include:

  • Sample sizes per group
  • Number of events
  • Risk table (at-risk numbers over time)
  • P-value from formal test

Rule 4: Focus on Practical Significance

Visual separation doesn't equal practical significance:

  • Annotate key time points (D30, D90)
  • Show confidence intervals
  • Note absolute differences, not just visual separation

Common Mistakes

Mistake 1: Testing All Pairs Without Overall Test

Wrong: Jump straight to 6 pairwise comparisons for 4 groups.

Right: Do overall test first. If not significant, stop.

Mistake 2: Using Unadjusted P-Values

Wrong: "Segment A vs. B: p=0.04" when you ran 10 tests.

Right: "Segment A vs. B: adjusted p=0.20 (Holm correction for 10 tests)"

Mistake 3: Over-Cluttered Visualizations

Wrong: 8 curves with confidence bands in one plot.

Right: Faceted plots, or highlight 2-3 key comparisons.

Mistake 4: Ignoring Effect Size

Wrong: "There's a significant difference (p=0.02)."

Right: "Premium users have 15 days longer median retention (72 vs 57 days, p=0.02)."



Key Takeaway

Comparing retention across multiple segments requires statistical care and visualization thoughtfulness. Use a two-stage approach: overall test first, then adjusted pairwise tests. Holm is generally preferred over Bonferroni for power. Keep visualizations simple—highlight key comparisons rather than plotting everything. Always report effect sizes (median differences, hazard ratios) alongside p-values, and remember that statistical significance doesn't equal practical importance.


References

  1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5506159/
  2. https://onlinelibrary.wiley.com/doi/10.1002/sim.5451
  3. https://www.nature.com/articles/s41586-019-1657-2
  4. Royston, P., & Parmar, M. K. (2011). The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. *Statistics in Medicine*, 30(19), 2409-2421.
  5. Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. *Biometrika*, 73(3), 751-754.
  6. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. *Journal of the Royal Statistical Society: Series B*, 57(1), 289-300.

Frequently Asked Questions

How many segments can I reasonably compare?
Statistically, you can compare any number with proper multiplicity adjustment. Practically, more than 4-5 curves become hard to visualize and interpret. Consider grouping similar segments or using Cox regression with segment as a covariate for many groups.
Do I need to adjust for multiple comparisons if I only care about one segment vs. control?
If you pre-specified that comparison, no adjustment needed. If you're testing all segments against control, that's still multiple comparisons. Dunnett's test is designed for comparing all groups to a single control and is less conservative than Bonferroni.
Should I adjust p-values or use a different alpha threshold?
Either works mathematically. Adjusting alpha (e.g., 0.05/3 = 0.017) or adjusting p-values (multiply by 3) gives the same decisions. Adjusted p-values are often easier to communicate: 'Adjusted p = 0.04' is clearer than 'p = 0.013 against threshold of 0.017.'

Key Takeaway

When comparing retention across multiple segments, use an overall test first, then adjust for multiple pairwise comparisons. Visualization is as important as statistics—use clear, uncluttered plots that highlight meaningful differences. Always report effect sizes alongside p-values.

Send to a friend

Share this with someone who loves clean statistical work.