Contents
Cox Proportional Hazards: What 'Proportional' Actually Means
A practical guide to Cox regression for product analysts. Learn what the proportional hazards assumption means, how to check it, what to do when it fails, and how to interpret hazard ratios correctly.
Quick Hits
- •Proportional hazards means hazard ratios are CONSTANT over time
- •If treatment cuts hazard by 30% at day 7, it cuts by 30% at day 90 too
- •Violation: effect changes over time (wears off, increases, reverses)
- •Check with Schoenfeld residuals or log-log plots
- •Violations don't invalidate Cox - they change interpretation
TL;DR
Cox proportional hazards regression models how covariates affect the hazard (risk of event). The key assumption—proportional hazards—means hazard ratios stay constant over time. When premium users have HR=0.6 at day 7, they should have HR=0.6 at day 90 too. Violations mean effects change over time. Check with Schoenfeld residuals or log-log plots. When violations occur, consider time-varying coefficients, stratification, or reporting period-specific effects.
The Cox Model
The Formula
$$h(t|X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + ...)$$
Where:
- $h(t|X)$ = hazard at time t for someone with covariates X
- $h_0(t)$ = baseline hazard (left unspecified - "semi-parametric")
- $\beta_j$ = log hazard ratio for covariate j
- $\exp(\beta_j)$ = hazard ratio
What Makes It "Proportional"
The hazard ratio between two individuals is:
$$\frac{h(t|X_1)}{h(t|X_2)} = \frac{h_0(t) \exp(\beta X_1)}{h_0(t) \exp(\beta X_2)} = \exp(\beta(X_1 - X_2))$$
The $h_0(t)$ terms cancel out! The ratio depends only on covariate differences, not time.
This is the proportional hazards assumption: The hazard ratio is constant over time.
What Proportional Hazards Means Visually
Proportional Hazards (Assumption Holds)
Hazard over time:
Group A: ----___----____---- Higher hazard
Group B: ---__---___---___ Lower hazard (consistently ~0.6× Group A)
The curves have the same SHAPE (rise and fall together)
but one is always a constant multiple of the other.
On the log(-log(survival)) plot, curves should be parallel.
Non-Proportional Hazards (Assumption Violated)
Hazard over time:
Early: A much higher than B (HR = 0.5)
Later: A and B similar (HR ≈ 1.0)
Late: B higher than A (HR = 1.5)
The ratio changes over time - can't summarize with one number.
Checking the Assumption
Method 1: Schoenfeld Residuals
Schoenfeld residuals should show no trend over time if PH holds.
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(data, duration_col='time', event_col='event')
# Check PH assumption
cph.check_assumptions(data, show_plots=True)
library(survival)
cox_model <- coxph(Surv(time, event) ~ treatment + age, data = data)
# Test and plot
ph_test <- cox.zph(cox_model)
print(ph_test)
plot(ph_test)
Interpretation:
- Flat line: PH holds
- Trending line: Violation (effect changes over time)
- Statistical test p < 0.05: Evidence against PH
Method 2: Log-Log Survival Plot
Plot log(-log(S(t))) vs. log(t) for each group:
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
fig, ax = plt.subplots()
for group in ['A', 'B']:
subset = data[data['group'] == group]
kmf = KaplanMeierFitter()
kmf.fit(subset['time'], subset['event'])
# Log-log transform
t = kmf.survival_function_.index
S = kmf.survival_function_.values.flatten()
# Avoid log(0) and log(1)
mask = (S > 0.001) & (S < 0.999) & (t > 0)
log_log_S = np.log(-np.log(S[mask]))
log_t = np.log(t[mask])
ax.plot(log_t, log_log_S, label=group)
ax.set_xlabel('log(Time)')
ax.set_ylabel('log(-log(S(t)))')
ax.legend()
ax.set_title('Log-Log Plot (Parallel = PH holds)')
Interpretation:
- Parallel curves: PH holds
- Curves that converge, diverge, or cross: PH violated
Method 3: Plotting Hazard Ratios Over Time
Estimate the hazard ratio in different time periods:
# Split data into time windows and estimate HR in each
windows = [(0, 30), (30, 60), (60, 90), (90, 180)]
for start, end in windows:
subset = data[(data['time'] >= start) | (data['time'] <= end)]
cph = CoxPHFitter()
cph.fit(subset, 'time', 'event')
hr = np.exp(cph.params_['treatment'])
print(f"Days {start}-{end}: HR = {hr:.2f}")
What To Do When PH Fails
Option 1: Time-Varying Coefficients
Allow the effect to change over time:
$$h(t|X) = h_0(t) \exp(\beta(t) X)$$
# In lifelines, use time-transform
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(data, 'time', 'event', formula='treatment + treatment:np.log(time)')
This allows treatment effect to change with log(time).
Option 2: Stratified Cox Model
Different baseline hazard for each level of the violating variable:
$$h(t|X, Z=z) = h_{0z}(t) \exp(\beta X)$$
# Stratify by a variable
cph = CoxPHFitter()
cph.fit(data, 'time', 'event', strata=['segment'])
Use when: You don't need the HR for the stratifying variable—you just want to account for different baseline hazards.
Option 3: Piecewise Models
Fit separate models for different time periods:
# Early effect (0-30 days)
early_data = data[data['time'] <= 30].copy()
early_data.loc[early_data['time'] > 30, 'event'] = 0
# Late effect (30+ days)
late_data = data[data['time'] > 30].copy()
# Adjust times to start from 0
late_data['time'] = late_data['time'] - 30
Option 4: Report Time-Specific Survival
Instead of a single HR, report survival at key time points:
| Time | Treatment | Control | Difference |
|---|---|---|---|
| D30 | 85% | 78% | +7pp |
| D60 | 72% | 70% | +2pp |
| D90 | 68% | 69% | -1pp |
This honestly shows that the treatment effect diminishes.
Code: Complete Cox Analysis
Python
import numpy as np
import pandas as pd
from lifelines import CoxPHFitter, KaplanMeierFitter
import matplotlib.pyplot as plt
def cox_analysis(data, time_col, event_col, covariates, check_ph=True):
"""
Complete Cox proportional hazards analysis.
Parameters:
-----------
data : pd.DataFrame
Dataset
time_col : str
Time variable
event_col : str
Event indicator
covariates : list
Covariate column names
check_ph : bool
Whether to check proportional hazards assumption
Returns:
--------
dict with model and diagnostics
"""
results = {}
# Prepare data
model_data = data[[time_col, event_col] + covariates].dropna()
# Fit Cox model
cph = CoxPHFitter()
cph.fit(model_data, duration_col=time_col, event_col=event_col)
results['model'] = cph
results['summary'] = cph.summary
# Hazard ratios with CI
hr_table = pd.DataFrame({
'Variable': cph.params_.index,
'Coefficient': cph.params_.values,
'Hazard Ratio': np.exp(cph.params_.values),
'HR CI Lower': np.exp(cph.confidence_intervals_['95% lower-bound'].values),
'HR CI Upper': np.exp(cph.confidence_intervals_['95% upper-bound'].values),
'P-value': cph.summary['p'].values
})
results['hazard_ratios'] = hr_table
# Concordance (discrimination)
results['concordance'] = cph.concordance_index_
# Check proportional hazards
if check_ph:
try:
# Returns test results and prints violations
ph_results = cph.check_assumptions(model_data, show_plots=False)
results['ph_test'] = ph_results
results['ph_violations'] = []
# Check each covariate
for var in covariates:
# This is simplified - real check uses Schoenfeld residuals
pass
except Exception as e:
results['ph_test'] = f"Could not check: {e}"
return results
def plot_schoenfeld(cph, data, time_col, event_col, covariate, figsize=(10, 6)):
"""
Plot Schoenfeld residuals for a covariate to check PH.
"""
model_data = data[[time_col, event_col, covariate]].dropna()
# Refit to get residuals
cph_temp = CoxPHFitter()
cph_temp.fit(model_data, duration_col=time_col, event_col=event_col)
# Get Schoenfeld residuals
residuals = cph_temp.compute_residuals(model_data, kind='schoenfeld')
fig, ax = plt.subplots(figsize=figsize)
event_times = model_data[model_data[event_col] == 1][time_col].values
ax.scatter(event_times, residuals[covariate].values, alpha=0.5)
# Add lowess smooth
from statsmodels.nonparametric.smoothers_lowess import lowess
smooth = lowess(residuals[covariate].values, event_times, frac=0.3)
ax.plot(smooth[:, 0], smooth[:, 1], 'r-', linewidth=2)
ax.axhline(0, color='grey', linestyle='--')
ax.set_xlabel('Time')
ax.set_ylabel(f'Schoenfeld Residual ({covariate})')
ax.set_title(f'PH Check: {covariate}\n(Flat = PH holds, Trend = Violation)')
return fig
def interpret_hazard_ratio(hr, var_name, var_type='binary'):
"""
Generate plain-English interpretation of hazard ratio.
"""
if hr < 1:
direction = "lower"
pct = (1 - hr) * 100
quality = "protective" if var_type == 'binary' else "associated with slower events"
else:
direction = "higher"
pct = (hr - 1) * 100
quality = "risk factor" if var_type == 'binary' else "associated with faster events"
if var_type == 'binary':
interp = f"{var_name} is associated with {pct:.0f}% {direction} hazard of the event. "
interp += f"This is a {quality}."
else:
interp = f"Each unit increase in {var_name} is associated with {pct:.1f}% {direction} hazard. "
return interp
# Example usage
if __name__ == "__main__":
np.random.seed(42)
n = 500
# Generate data with proportional hazards
data = pd.DataFrame({
'treatment': np.random.binomial(1, 0.5, n),
'age': np.random.normal(40, 10, n),
'premium': np.random.binomial(1, 0.3, n)
})
# Survival times (proportional hazards)
log_hazard = -4 + 0.5 * data['treatment'] + 0.02 * data['age'] - 0.3 * data['premium']
hazard = np.exp(log_hazard)
data['time'] = np.random.exponential(1/hazard)
data['time'] = np.minimum(data['time'], 365)
data['event'] = np.random.binomial(1, 0.8, n)
# Run analysis
results = cox_analysis(data, 'time', 'event', ['treatment', 'age', 'premium'])
print("Cox Proportional Hazards Results")
print("=" * 60)
print("\nHazard Ratios:")
print(results['hazard_ratios'].to_string(index=False))
print(f"\nConcordance Index: {results['concordance']:.3f}")
# Interpretations
print("\nInterpretations:")
for _, row in results['hazard_ratios'].iterrows():
var = row['Variable']
hr = row['Hazard Ratio']
var_type = 'binary' if var in ['treatment', 'premium'] else 'continuous'
print(f" {interpret_hazard_ratio(hr, var, var_type)}")
R
library(tidyverse)
library(survival)
library(survminer)
cox_analysis <- function(data, time_col, event_col, covariates) {
#' Complete Cox analysis with PH checking
# Create formula
formula <- as.formula(sprintf(
"Surv(%s, %s) ~ %s",
time_col, event_col,
paste(covariates, collapse = " + ")
))
# Fit model
model <- coxph(formula, data = data)
# Hazard ratios
hr_table <- tibble(
Variable = names(coef(model)),
Coefficient = coef(model),
`Hazard Ratio` = exp(coef(model)),
`HR CI Lower` = exp(confint(model))[, 1],
`HR CI Upper` = exp(confint(model))[, 2],
`P-value` = summary(model)$coefficients[, "Pr(>|z|)"]
)
# Concordance
concordance <- summary(model)$concordance[1]
# PH test
ph_test <- cox.zph(model)
list(
model = model,
hazard_ratios = hr_table,
concordance = concordance,
ph_test = ph_test
)
}
# Example
set.seed(42)
n <- 500
data <- tibble(
treatment = rbinom(n, 1, 0.5),
age = rnorm(n, 40, 10),
premium = rbinom(n, 1, 0.3)
) %>%
mutate(
log_hazard = -4 + 0.5*treatment + 0.02*age - 0.3*premium,
time = rexp(n, exp(log_hazard)),
time = pmin(time, 365),
event = rbinom(n, 1, 0.8)
)
# Run analysis
results <- cox_analysis(data, "time", "event", c("treatment", "age", "premium"))
cat("Cox Proportional Hazards Results\n")
cat(strrep("=", 60), "\n")
cat("\nHazard Ratios:\n")
print(results$hazard_ratios)
cat(sprintf("\nConcordance Index: %.3f\n", results$concordance))
# PH test
cat("\nProportional Hazards Test:\n")
print(results$ph_test)
# Plot Schoenfeld residuals
plot(results$ph_test)
Interpreting When PH Fails
What Your HR Means Under Non-Proportionality
When PH fails, the single HR is essentially an average across time, weighted by the number of events at each time. This can be misleading:
Example: Treatment helps early (HR = 0.3) but hurts late (HR = 1.5)
- Overall HR might be ~0.7 (averaging)
- But this doesn't represent the effect at any specific time
Honest Reporting
Instead of one HR, report:
- "Treatment effect varies over time"
- "HR = 0.3 for days 1-30, HR = 0.9 for days 30-90, HR = 1.2 for days 90+"
- Or: survival at key time points with CIs
Related Methods
- Time-to-Event and Retention Analysis (Pillar) - Full survival framework
- Hazard Ratio Interpretation - Understanding HRs
- Log-Rank Test - Simpler comparison
- Alternatives to Cox - When Cox doesn't fit
Key Takeaway
The proportional hazards assumption means hazard ratios are constant over time. This is what lets you summarize an effect with a single number. When PH holds, Cox regression is powerful and interpretable. When it fails, your HR is an average that may not represent the effect at any specific time. Always check with Schoenfeld residuals or log-log plots. When PH fails, consider time-varying coefficients, stratification, or honest reporting of effects at different time periods.
References
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6015946/
- https://cran.r-project.org/web/packages/survival/vignettes/survival.pdf
- https://doi.org/10.1080/00031305.2020.1790553
- Persson, I., & Khamis, H. (2005). Bias of the Cox model hazard ratio. *Journal of Modern Applied Statistical Methods*, 4(1), 90-99.
- Therneau, T. M., & Grambsch, P. M. (2000). Modeling survival data: extending the Cox model. Springer.
- Syriopoulou, E., Mozumder, S. I., Rutherford, M. J., & Lambert, P. C. (2022). Robustness of individual and marginal model-based estimates: A sensitivity analysis of flexible parametric models. *PLoS ONE*, 17(4), e0267016.
Frequently Asked Questions
What does 'proportional hazards' mean in plain English?
What happens if the proportional hazards assumption is violated?
How do I check the proportional hazards assumption?
Key Takeaway
The proportional hazards assumption means hazard ratios are constant over time. When it holds, a single HR summarizes the effect. When it fails, the effect varies over time—you need time-varying coefficients, stratification, or period-specific analyses. Always check the assumption; violations change what your results mean.