Survival Analysis

Log-Rank Test: When It's Appropriate and Common Misuses

A practical guide to the log-rank test for comparing survival curves. Learn when it works, when it fails, and better alternatives when proportional hazards don't hold.

Jan 268 min readstatstest_flow Survival Analysis Supporting

Log-Rank Test: When It's Appropriate and Common Misuses

Quick Hits

•Log-rank tests if survival curves differ, not where or how much
•Assumes proportional hazards - curves shouldn't cross
•Weights events equally across time (late events count same as early)
•Use Wilcoxon/Breslow if early differences matter more
•Significant ≠ meaningful - always report effect sizes (hazard ratios or median differences)

TL;DR

The log-rank test compares survival curves between groups, answering: "Do these curves differ significantly?" It's optimal under proportional hazards (curves don't cross), but can mislead when hazards aren't proportional. Always report effect sizes alongside p-values—a significant log-rank tells you curves differ but not by how much. Consider alternatives (Wilcoxon, RMST) when curves cross or early events matter more.

What the Log-Rank Test Does

The Hypothesis

$H_0: S_1(t) = S_2(t) \text{ for all } t$

In words: The survival (retention) curves are identical at every time point.

How It Works

At each event time:

Count observed events in each group
Calculate expected events (based on at-risk numbers)
Compare observed vs. expected

Sum across all event times → chi-square statistic

What It Tells You

Significant p-value: Curves differ somewhere
Non-significant p-value: Can't distinguish curves from chance

What It DOESN'T Tell You

How much the curves differ
Where they differ (early vs. late)
Clinical/business significance

When Log-Rank Works Best

Ideal Conditions

Proportional hazards: The ratio of hazard rates is constant over time
- If treatment halves the hazard at day 7, it halves it at day 30 too
- Visually: curves don't cross
Censoring is non-informative: Censoring doesn't depend on the outcome
- Users who drop out would have similar survival to those who stayed
You want equal weight across time: Early and late events matter equally

Visual Check: Proportional Hazards

PROPORTIONAL HAZARDS (OK):       NON-PROPORTIONAL (PROBLEM):

1.0|---\                         1.0|----\
   |    \                           |     \
0.5|     \----                   0.5|      \----/----
   |          \----                 |    /---
0.0+---------------              0.0+---------------
   0   30   60   90                 0   30   60   90

Curves maintain relative            Curves cross - one group
distance over time                  better early, other better late

Common Misuses

Misuse 1: Ignoring Crossing Curves

Problem: When curves cross, one group does better early, the other does better late. Log-rank may average to no significant difference, even though differences exist.

Example: New treatment extends survival but has high early mortality. Curves cross at day 30. Log-rank $p = 0.15$ (not significant), but there are clearly different effects at different times.

Solution: Use alternative methods (see below) or report survival at specific time points.

Misuse 2: Reporting Only the P-Value

Bad: "Log-rank test: $p = 0.03$ , so treatment works."

Better: "Treatment improved median survival from 45 days to 72 days (HR = 0.68, 95% CI: 0.48–0.96, log-rank $p = 0.03$ )."

Misuse 3: Using Log-Rank for Paired/Matched Data

Problem: Log-rank assumes independent groups. Paired data (matched patients, crossover designs) needs different methods.

Solution: Use stratified log-rank or matched-pairs survival methods.

Misuse 4: Multiple Groups Without Adjustment

Problem: Testing A vs. B, A vs. C, B vs. C separately inflates Type I error.

Solution: Use overall k-group log-rank test first. If significant, do pairwise comparisons with multiplicity adjustment.

Misuse 5: Too Much Focus on Statistical Significance

Problem: With large samples, tiny differences become significant. With small samples, large differences may not be.

Solution: Always report effect sizes. A 1-day difference in median survival could be significant with n=100,000 but meaningless in practice.

Code: Log-Rank Test

Python

import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test, multivariate_logrank_test
import matplotlib.pyplot as plt


def compare_survival_curves(data, time_col, event_col, group_col, figsize=(10, 6)):
    """
    Compare survival curves with log-rank test and visual inspection.

    Parameters:
    -----------
    data : pd.DataFrame
        Dataset
    time_col : str
        Time variable
    event_col : str
        Event indicator (1=event, 0=censored)
    group_col : str
        Grouping variable

    Returns:
    --------
    dict with test results and plot
    """
    groups = sorted(data[group_col].unique())
    results = {}

    # Fit KM curves for each group
    fig, ax = plt.subplots(figsize=figsize)
    kmf_dict = {}

    for group in groups:
        mask = data[group_col] == group
        kmf = KaplanMeierFitter()
        kmf.fit(
            data.loc[mask, time_col],
            data.loc[mask, event_col],
            label=f'{group_col}={group}'
        )
        kmf.plot_survival_function(ax=ax, ci_show=True)
        kmf_dict[group] = kmf

    ax.set_xlabel('Time')
    ax.set_ylabel('Survival Probability')
    ax.set_title(f'Survival Curves by {group_col}')
    ax.legend(loc='lower left')

    results['kmf_by_group'] = kmf_dict
    results['figure'] = fig

    # Log-rank test
    if len(groups) == 2:
        g1, g2 = groups
        lr = logrank_test(
            data.loc[data[group_col] == g1, time_col],
            data.loc[data[group_col] == g2, time_col],
            data.loc[data[group_col] == g1, event_col],
            data.loc[data[group_col] == g2, event_col]
        )
        results['logrank'] = {
            'test_statistic': lr.test_statistic,
            'p_value': lr.p_value,
            'df': 1
        }

        # Median survival difference
        med1 = kmf_dict[g1].median_survival_time_
        med2 = kmf_dict[g2].median_survival_time_
        results['median_diff'] = {
            f'{g1}': med1,
            f'{g2}': med2,
            'difference': med2 - med1 if pd.notna(med1) and pd.notna(med2) else None
        }

    else:
        # Multiple groups
        mlr = multivariate_logrank_test(
            data[time_col],
            data[group_col],
            data[event_col]
        )
        results['logrank'] = {
            'test_statistic': mlr.test_statistic,
            'p_value': mlr.p_value,
            'df': len(groups) - 1
        }

    # Check proportional hazards (visual)
    results['crossing_warning'] = check_curves_crossing(data, time_col, event_col, group_col)

    return results


def check_curves_crossing(data, time_col, event_col, group_col):
    """
    Check if survival curves cross (violating proportional hazards).
    """
    groups = sorted(data[group_col].unique())
    if len(groups) != 2:
        return "Multiple groups - check visually"

    g1, g2 = groups

    # Fit KM curves
    kmf1 = KaplanMeierFitter()
    kmf2 = KaplanMeierFitter()

    mask1 = data[group_col] == g1
    mask2 = data[group_col] == g2

    kmf1.fit(data.loc[mask1, time_col], data.loc[mask1, event_col])
    kmf2.fit(data.loc[mask2, time_col], data.loc[mask2, event_col])

    # Get survival at common time points
    times = np.unique(data[time_col])
    times = times[times > 0]

    s1 = kmf1.survival_function_at_times(times).values
    s2 = kmf2.survival_function_at_times(times).values

    # Check for crossings
    diff = s1 - s2
    sign_changes = np.sum(np.diff(np.sign(diff)) != 0)

    if sign_changes > 0:
        return f"⚠️ Curves cross {sign_changes} time(s). Log-rank may be unreliable."
    else:
        return "✓ No crossing detected. Proportional hazards assumption appears reasonable."


def report_logrank_result(results, group1_name, group2_name):
    """
    Generate stakeholder-friendly report of log-rank results.
    """
    lr = results['logrank']
    med = results.get('median_diff', {})

    report = []
    report.append("Survival Comparison Results")
    report.append("=" * 50)

    # Median survival
    if med:
        report.append(f"\nMedian Survival Time:")
        report.append(f"  {group1_name}: {med.get(group1_name, 'Not reached')}")
        report.append(f"  {group2_name}: {med.get(group2_name, 'Not reached')}")
        if med.get('difference') is not None:
            report.append(f"  Difference: {med['difference']:.1f} days")

    # Log-rank test
    report.append(f"\nLog-Rank Test:")
    report.append(f"  χ² = {lr['test_statistic']:.2f}, df = {lr['df']}, p = {lr['p_value']:.4f}")

    if lr['p_value'] < 0.05:
        report.append(f"\n  → Survival curves are significantly different (p < 0.05)")
    else:
        report.append(f"\n  → No significant difference detected (p ≥ 0.05)")

    # Warning
    if 'crossing_warning' in results:
        report.append(f"\n{results['crossing_warning']}")

    return "\n".join(report)


# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    n = 300

    # Generate two groups with different survival
    group = np.array(['Control'] * (n//2) + ['Treatment'] * (n//2))
    hazard = np.where(group == 'Control', 0.02, 0.015)  # Treatment has lower hazard
    survival_time = np.random.exponential(1/hazard)

    # Censoring at random times
    censor_time = np.random.exponential(100, n)
    observed_time = np.minimum(survival_time, censor_time)
    event = (survival_time <= censor_time).astype(int)

    data = pd.DataFrame({
        'time': observed_time,
        'event': event,
        'group': group
    })

    # Compare
    results = compare_survival_curves(data, 'time', 'event', 'group')

    # Report
    print(report_logrank_result(results, 'Control', 'Treatment'))

    plt.show()

R

library(tidyverse)
library(survival)
library(survminer)


compare_survival_curves <- function(data, time_col, event_col, group_col) {
    #' Compare survival curves with log-rank test

    # Create survival object
    formula <- as.formula(sprintf("Surv(%s, %s) ~ %s", time_col, event_col, group_col))

    # Fit KM
    km_fit <- survfit(formula, data = data)

    # Log-rank test
    lr_test <- survdiff(formula, data = data)

    # P-value
    p_value <- pchisq(lr_test$chisq, df = length(lr_test$n) - 1, lower.tail = FALSE)

    # Median survival by group
    med_surv <- surv_median(km_fit)

    list(
        fit = km_fit,
        logrank = list(
            chisq = lr_test$chisq,
            df = length(lr_test$n) - 1,
            p_value = p_value
        ),
        median_survival = med_surv
    )
}


# Example
set.seed(42)
n <- 300

data <- tibble(
    group = c(rep("Control", n/2), rep("Treatment", n/2)),
    hazard = ifelse(group == "Control", 0.02, 0.015)
) %>%
    mutate(
        survival_time = rexp(n, hazard),
        censor_time = rexp(n, 0.01),
        time = pmin(survival_time, censor_time),
        event = as.integer(survival_time <= censor_time)
    )

results <- compare_survival_curves(data, "time", "event", "group")

cat("Log-Rank Test Results\n")
cat(strrep("=", 50), "\n")
cat(sprintf("χ² = %.2f, df = %d, p = %.4f\n",
            results$logrank$chisq, results$logrank$df, results$logrank$p_value))

cat("\nMedian Survival by Group:\n")
print(results$median_survival)

# Plot with p-value
ggsurvplot(
    results$fit,
    data = data,
    pval = TRUE,
    risk.table = TRUE,
    legend.labs = c("Control", "Treatment")
)

Alternatives to Log-Rank

Wilcoxon (Breslow) Test

Weights early events more heavily.

Use when:

Early survival is more important
You expect effects to wear off over time

from lifelines.statistics import logrank_test

# Wilcoxon weighting (Breslow)
result = logrank_test(T1, T2, E1, E2, weightings='wilcoxon')

Peto-Peto Test

More robust to heavy censoring.

Use when:

High censoring rates
Censoring patterns differ between groups

Restricted Mean Survival Time (RMST)

Compares area under survival curves up to a time horizon.

Use when:

Curves cross
You want an interpretable effect measure (days gained)
Non-proportional hazards

# RMST in lifelines
from lifelines import restricted_mean_survival_time

rmst1 = restricted_mean_survival_time(kmf1, t=180)
rmst2 = restricted_mean_survival_time(kmf2, t=180)
rmst_diff = rmst2 - rmst1  # Days gained with treatment

Landmark Analysis

Compare survival at a fixed landmark time.

Use when:

Effects differ early vs. late
You want to answer "what's the chance of surviving past day X?"

Weighted Log-Rank (Fleming-Harrington)

Allows custom weighting across time.

$G^{(\rho, \gamma)}(t) = [S(t^-)]^\rho [1-S(t^-)]^\gamma$

ρ=0, γ=0 → standard log-rank
ρ=1, γ=0 → Wilcoxon (early events)
ρ=0, γ=1 → late events weighted more

Decision Framework

START: Compare two survival curves
       ↓
CHECK: Do curves cross?
       ├── No → Log-rank is appropriate
       └── Yes → Log-rank may mislead
             ↓
       Consider: RMST, landmark analysis, or piecewise tests
       ↓
CHECK: Does early survival matter more?
       ├── Yes → Wilcoxon/Breslow test
       └── No → Continue
       ↓
CHECK: Heavy censoring?
       ├── Yes → Peto-Peto test
       └── No → Continue
       ↓
USE: Standard log-rank test
       ↓
ALWAYS: Report effect sizes (median difference, HR, RMST)

Time-to-Event and Retention Analysis (Pillar) - Full survival framework
Kaplan-Meier Curves - Estimating survival curves
Cox Proportional Hazards - Regression approach
Comparing Retention Curves - Multiple group comparisons

Key Takeaway

The log-rank test asks "do these survival curves differ?" but doesn't tell you how much or where. It's optimal when hazards are proportional (curves don't cross) and you weight all time points equally. When curves cross, use alternatives like RMST or landmark analysis. Never report just the p-value—always include effect sizes (median survival difference, hazard ratio) that answer the practical question: "How much better is one group than the other?"

References

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403858/
https://www.tandfonline.com/doi/abs/10.1080/00031305.2019.1589789
https://onlinelibrary.wiley.com/doi/10.1002/sim.7977
Bland, J. M., & Altman, D. G. (2004). The logrank test. *BMJ*, 328(7447), 1073.
Huang, B., & Kuan, P. F. (2018). Comparison of the restricted mean survival time with the hazard ratio in superiority trials with a time-to-event end point. *Pharmaceutical Statistics*, 17(3), 202-213.
Li, H., Han, D., Hou, Y., Chen, H., & Chen, Z. (2015). Statistical inference methods for two crossing survival curves: a comparison of methods. *PLoS ONE*, 10(1), e0116774.

Frequently Asked Questions

What does the log-rank test actually test?

It tests H₀: S₁(t) = S₂(t) for all t, i.e., the survival curves are identical at every time point. A significant result means the curves differ somewhere, but doesn't say where or how much.

What if my survival curves cross?

Crossing curves violate proportional hazards—one group may do better early but worse later. Log-rank can give misleading results (could be non-significant despite clear differences). Consider restricted mean survival time (RMST), landmark analysis, or piecewise tests.

How do I report log-rank test results?

Report the chi-square statistic, degrees of freedom, and p-value. Always accompany with descriptive stats: median survival per group, hazard ratio from Cox regression, or survival at key time points with CIs. A p-value alone is not enough.

Key Takeaway

The log-rank test is powerful for comparing survival curves under proportional hazards, but useless on its own. Always pair it with effect size measures (median survival difference, hazard ratio) and check that curves don't cross. Use alternatives when curves cross or you care more about early survival.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email