Contents
Log-Rank Test: When It's Appropriate and Common Misuses
A practical guide to the log-rank test for comparing survival curves. Learn when it works, when it fails, and better alternatives when proportional hazards don't hold.
Quick Hits
- •Log-rank tests if survival curves differ, not where or how much
- •Assumes proportional hazards - curves shouldn't cross
- •Weights events equally across time (late events count same as early)
- •Use Wilcoxon/Breslow if early differences matter more
- •Significant ≠ meaningful - always report effect sizes (hazard ratios or median differences)
TL;DR
The log-rank test compares survival curves between groups, answering: "Do these curves differ significantly?" It's optimal under proportional hazards (curves don't cross), but can mislead when hazards aren't proportional. Always report effect sizes alongside p-values—a significant log-rank tells you curves differ but not by how much. Consider alternatives (Wilcoxon, RMST) when curves cross or early events matter more.
What the Log-Rank Test Does
The Hypothesis
$$H_0: S_1(t) = S_2(t) \text{ for all } t$$
In words: The survival (retention) curves are identical at every time point.
How It Works
At each event time:
- Count observed events in each group
- Calculate expected events (based on at-risk numbers)
- Compare observed vs. expected
Sum across all event times → chi-square statistic
What It Tells You
- Significant p-value: Curves differ somewhere
- Non-significant p-value: Can't distinguish curves from chance
What It DOESN'T Tell You
- How much the curves differ
- Where they differ (early vs. late)
- Clinical/business significance
When Log-Rank Works Best
Ideal Conditions
-
Proportional hazards: The ratio of hazard rates is constant over time
- If treatment halves the hazard at day 7, it halves it at day 30 too
- Visually: curves don't cross
-
Censoring is non-informative: Censoring doesn't depend on the outcome
- Users who drop out would have similar survival to those who stayed
-
You want equal weight across time: Early and late events matter equally
Visual Check: Proportional Hazards
PROPORTIONAL HAZARDS (OK): NON-PROPORTIONAL (PROBLEM):
1.0|---\ 1.0|----\
| \ | \
0.5| \---- 0.5| \----/----
| \---- | /---
0.0+--------------- 0.0+---------------
0 30 60 90 0 30 60 90
Curves maintain relative Curves cross - one group
distance over time better early, other better late
Common Misuses
Misuse 1: Ignoring Crossing Curves
Problem: When curves cross, one group does better early, the other does better late. Log-rank may average to no significant difference, even though differences exist.
Example: New treatment extends survival but has high early mortality. Curves cross at day 30. Log-rank p = 0.15 (not significant), but there are clearly different effects at different times.
Solution: Use alternative methods (see below) or report survival at specific time points.
Misuse 2: Reporting Only the P-Value
Bad: "Log-rank test: p = 0.03, so treatment works."
Better: "Treatment improved median survival from 45 days to 72 days (HR = 0.68, 95% CI: 0.48–0.96, log-rank p = 0.03)."
Misuse 3: Using Log-Rank for Paired/Matched Data
Problem: Log-rank assumes independent groups. Paired data (matched patients, crossover designs) needs different methods.
Solution: Use stratified log-rank or matched-pairs survival methods.
Misuse 4: Multiple Groups Without Adjustment
Problem: Testing A vs. B, A vs. C, B vs. C separately inflates Type I error.
Solution: Use overall k-group log-rank test first. If significant, do pairwise comparisons with multiplicity adjustment.
Misuse 5: Too Much Focus on Statistical Significance
Problem: With large samples, tiny differences become significant. With small samples, large differences may not be.
Solution: Always report effect sizes. A 1-day difference in median survival could be significant with n=100,000 but meaningless in practice.
Code: Log-Rank Test
Python
import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test, multivariate_logrank_test
import matplotlib.pyplot as plt
def compare_survival_curves(data, time_col, event_col, group_col, figsize=(10, 6)):
"""
Compare survival curves with log-rank test and visual inspection.
Parameters:
-----------
data : pd.DataFrame
Dataset
time_col : str
Time variable
event_col : str
Event indicator (1=event, 0=censored)
group_col : str
Grouping variable
Returns:
--------
dict with test results and plot
"""
groups = sorted(data[group_col].unique())
results = {}
# Fit KM curves for each group
fig, ax = plt.subplots(figsize=figsize)
kmf_dict = {}
for group in groups:
mask = data[group_col] == group
kmf = KaplanMeierFitter()
kmf.fit(
data.loc[mask, time_col],
data.loc[mask, event_col],
label=f'{group_col}={group}'
)
kmf.plot_survival_function(ax=ax, ci_show=True)
kmf_dict[group] = kmf
ax.set_xlabel('Time')
ax.set_ylabel('Survival Probability')
ax.set_title(f'Survival Curves by {group_col}')
ax.legend(loc='lower left')
results['kmf_by_group'] = kmf_dict
results['figure'] = fig
# Log-rank test
if len(groups) == 2:
g1, g2 = groups
lr = logrank_test(
data.loc[data[group_col] == g1, time_col],
data.loc[data[group_col] == g2, time_col],
data.loc[data[group_col] == g1, event_col],
data.loc[data[group_col] == g2, event_col]
)
results['logrank'] = {
'test_statistic': lr.test_statistic,
'p_value': lr.p_value,
'df': 1
}
# Median survival difference
med1 = kmf_dict[g1].median_survival_time_
med2 = kmf_dict[g2].median_survival_time_
results['median_diff'] = {
f'{g1}': med1,
f'{g2}': med2,
'difference': med2 - med1 if pd.notna(med1) and pd.notna(med2) else None
}
else:
# Multiple groups
mlr = multivariate_logrank_test(
data[time_col],
data[group_col],
data[event_col]
)
results['logrank'] = {
'test_statistic': mlr.test_statistic,
'p_value': mlr.p_value,
'df': len(groups) - 1
}
# Check proportional hazards (visual)
results['crossing_warning'] = check_curves_crossing(data, time_col, event_col, group_col)
return results
def check_curves_crossing(data, time_col, event_col, group_col):
"""
Check if survival curves cross (violating proportional hazards).
"""
groups = sorted(data[group_col].unique())
if len(groups) != 2:
return "Multiple groups - check visually"
g1, g2 = groups
# Fit KM curves
kmf1 = KaplanMeierFitter()
kmf2 = KaplanMeierFitter()
mask1 = data[group_col] == g1
mask2 = data[group_col] == g2
kmf1.fit(data.loc[mask1, time_col], data.loc[mask1, event_col])
kmf2.fit(data.loc[mask2, time_col], data.loc[mask2, event_col])
# Get survival at common time points
times = np.unique(data[time_col])
times = times[times > 0]
s1 = kmf1.survival_function_at_times(times).values
s2 = kmf2.survival_function_at_times(times).values
# Check for crossings
diff = s1 - s2
sign_changes = np.sum(np.diff(np.sign(diff)) != 0)
if sign_changes > 0:
return f"⚠️ Curves cross {sign_changes} time(s). Log-rank may be unreliable."
else:
return "✓ No crossing detected. Proportional hazards assumption appears reasonable."
def report_logrank_result(results, group1_name, group2_name):
"""
Generate stakeholder-friendly report of log-rank results.
"""
lr = results['logrank']
med = results.get('median_diff', {})
report = []
report.append("Survival Comparison Results")
report.append("=" * 50)
# Median survival
if med:
report.append(f"\nMedian Survival Time:")
report.append(f" {group1_name}: {med.get(group1_name, 'Not reached')}")
report.append(f" {group2_name}: {med.get(group2_name, 'Not reached')}")
if med.get('difference') is not None:
report.append(f" Difference: {med['difference']:.1f} days")
# Log-rank test
report.append(f"\nLog-Rank Test:")
report.append(f" χ² = {lr['test_statistic']:.2f}, df = {lr['df']}, p = {lr['p_value']:.4f}")
if lr['p_value'] < 0.05:
report.append(f"\n → Survival curves are significantly different (p < 0.05)")
else:
report.append(f"\n → No significant difference detected (p ≥ 0.05)")
# Warning
if 'crossing_warning' in results:
report.append(f"\n{results['crossing_warning']}")
return "\n".join(report)
# Example usage
if __name__ == "__main__":
np.random.seed(42)
n = 300
# Generate two groups with different survival
group = np.array(['Control'] * (n//2) + ['Treatment'] * (n//2))
hazard = np.where(group == 'Control', 0.02, 0.015) # Treatment has lower hazard
survival_time = np.random.exponential(1/hazard)
# Censoring at random times
censor_time = np.random.exponential(100, n)
observed_time = np.minimum(survival_time, censor_time)
event = (survival_time <= censor_time).astype(int)
data = pd.DataFrame({
'time': observed_time,
'event': event,
'group': group
})
# Compare
results = compare_survival_curves(data, 'time', 'event', 'group')
# Report
print(report_logrank_result(results, 'Control', 'Treatment'))
plt.show()
R
library(tidyverse)
library(survival)
library(survminer)
compare_survival_curves <- function(data, time_col, event_col, group_col) {
#' Compare survival curves with log-rank test
# Create survival object
formula <- as.formula(sprintf("Surv(%s, %s) ~ %s", time_col, event_col, group_col))
# Fit KM
km_fit <- survfit(formula, data = data)
# Log-rank test
lr_test <- survdiff(formula, data = data)
# P-value
p_value <- pchisq(lr_test$chisq, df = length(lr_test$n) - 1, lower.tail = FALSE)
# Median survival by group
med_surv <- surv_median(km_fit)
list(
fit = km_fit,
logrank = list(
chisq = lr_test$chisq,
df = length(lr_test$n) - 1,
p_value = p_value
),
median_survival = med_surv
)
}
# Example
set.seed(42)
n <- 300
data <- tibble(
group = c(rep("Control", n/2), rep("Treatment", n/2)),
hazard = ifelse(group == "Control", 0.02, 0.015)
) %>%
mutate(
survival_time = rexp(n, hazard),
censor_time = rexp(n, 0.01),
time = pmin(survival_time, censor_time),
event = as.integer(survival_time <= censor_time)
)
results <- compare_survival_curves(data, "time", "event", "group")
cat("Log-Rank Test Results\n")
cat(strrep("=", 50), "\n")
cat(sprintf("χ² = %.2f, df = %d, p = %.4f\n",
results$logrank$chisq, results$logrank$df, results$logrank$p_value))
cat("\nMedian Survival by Group:\n")
print(results$median_survival)
# Plot with p-value
ggsurvplot(
results$fit,
data = data,
pval = TRUE,
risk.table = TRUE,
legend.labs = c("Control", "Treatment")
)
Alternatives to Log-Rank
Wilcoxon (Breslow) Test
Weights early events more heavily.
Use when:
- Early survival is more important
- You expect effects to wear off over time
from lifelines.statistics import logrank_test
# Wilcoxon weighting (Breslow)
result = logrank_test(T1, T2, E1, E2, weightings='wilcoxon')
Peto-Peto Test
More robust to heavy censoring.
Use when:
- High censoring rates
- Censoring patterns differ between groups
Restricted Mean Survival Time (RMST)
Compares area under survival curves up to a time horizon.
Use when:
- Curves cross
- You want an interpretable effect measure (days gained)
- Non-proportional hazards
# RMST in lifelines
from lifelines import restricted_mean_survival_time
rmst1 = restricted_mean_survival_time(kmf1, t=180)
rmst2 = restricted_mean_survival_time(kmf2, t=180)
rmst_diff = rmst2 - rmst1 # Days gained with treatment
Landmark Analysis
Compare survival at a fixed landmark time.
Use when:
- Effects differ early vs. late
- You want to answer "what's the chance of surviving past day X?"
Weighted Log-Rank (Fleming-Harrington)
Allows custom weighting across time.
$$G^{(\rho, \gamma)}(t) = [S(t^-)]^\rho [1-S(t^-)]^\gamma$$
- ρ=0, γ=0 → standard log-rank
- ρ=1, γ=0 → Wilcoxon (early events)
- ρ=0, γ=1 → late events weighted more
Decision Framework
START: Compare two survival curves
↓
CHECK: Do curves cross?
├── No → Log-rank is appropriate
└── Yes → Log-rank may mislead
↓
Consider: RMST, landmark analysis, or piecewise tests
↓
CHECK: Does early survival matter more?
├── Yes → Wilcoxon/Breslow test
└── No → Continue
↓
CHECK: Heavy censoring?
├── Yes → Peto-Peto test
└── No → Continue
↓
USE: Standard log-rank test
↓
ALWAYS: Report effect sizes (median difference, HR, RMST)
Related Methods
- Time-to-Event and Retention Analysis (Pillar) - Full survival framework
- Kaplan-Meier Curves - Estimating survival curves
- Cox Proportional Hazards - Regression approach
- Comparing Retention Curves - Multiple group comparisons
Key Takeaway
The log-rank test asks "do these survival curves differ?" but doesn't tell you how much or where. It's optimal when hazards are proportional (curves don't cross) and you weight all time points equally. When curves cross, use alternatives like RMST or landmark analysis. Never report just the p-value—always include effect sizes (median survival difference, hazard ratio) that answer the practical question: "How much better is one group than the other?"
References
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403858/
- https://www.tandfonline.com/doi/abs/10.1080/00031305.2019.1589789
- https://onlinelibrary.wiley.com/doi/10.1002/sim.7977
- Bland, J. M., & Altman, D. G. (2004). The logrank test. *BMJ*, 328(7447), 1073.
- Huang, B., & Kuan, P. F. (2018). Comparison of the restricted mean survival time with the hazard ratio in superiority trials with a time-to-event end point. *Pharmaceutical Statistics*, 17(3), 202-213.
- Li, H., Han, D., Hou, Y., Chen, H., & Chen, Z. (2015). Statistical inference methods for two crossing survival curves: a comparison of methods. *PLoS ONE*, 10(1), e0116774.
Frequently Asked Questions
What does the log-rank test actually test?
What if my survival curves cross?
How do I report log-rank test results?
Key Takeaway
The log-rank test is powerful for comparing survival curves under proportional hazards, but useless on its own. Always pair it with effect size measures (median survival difference, hazard ratio) and check that curves don't cross. Use alternatives when curves cross or you care more about early survival.