Time Series

Granger Causality: Does Feature Usage Actually Drive Retention?

How to use Granger causality to test whether feature usage predicts retention, and why correlation over time is not causation.

Jan 297 min readstatstest_flow Time Series Supporting

Granger Causality: Does Feature Usage Actually Drive Retention?

Quick Hits

•Granger causality tests whether X predicts Y beyond Y's own history -- not true causation
•Two metrics can be correlated over time simply because both have trends -- always difference first
•The test requires stationary time series -- apply differencing or detrending before running it
•Reverse causality is common: retention might drive feature usage, not the other way around
•Use VAR models to test bidirectional Granger causality simultaneously

TL;DR

Your product team believes that feature usage drives retention. But does past feature usage actually predict future retention, or are both just trending up together? Granger causality provides a statistical framework for testing this. It asks: do past values of feature usage improve predictions of retention beyond what retention's own past values provide? This guide covers the method, implementation, interpretation, and the crucial caveats about what Granger causality does and does not prove.

The Question Every Product Team Asks

"Users who use Feature X have higher retention." This observation is ubiquitous in product analytics. But it raises a critical question: does Feature X drive retention, or do retained users simply use more features?

Correlation between two metrics at the same point in time cannot answer this. Even looking at the time series -- feature usage going up while retention goes up -- does not help, because both might be responding to a third factor (a marketing push, a seasonal cycle, organic growth).

Granger causality offers a structured approach: does knowing the past of feature usage help predict the future of retention, even after accounting for retention's own past? If yes, there is a temporal predictive relationship that is at least consistent with (though not proof of) a causal mechanism.

What Granger Causality Means

The Formal Definition

$X$ Granger-causes $Y$ if past values of $X$ contain information that helps predict $Y$ beyond what past values of $Y$ alone provide.

Formally, $X$ Granger-causes $Y$ if:

$\text{MSE}[E(Y_t | Y_{t-1}, \ldots, Y_{t-p})] > \text{MSE}[E(Y_t | Y_{t-1}, \ldots, Y_{t-p}, X_{t-1}, \ldots, X_{t-p})]$

In other words, the forecast error for $Y$ is smaller when you include past $X$ than when you use past $Y$ alone.

What It Is

A test for temporal precedence: Does $X$ lead $Y$ in time?
A test for incremental predictive power: Does $X$ add information beyond $Y$ 's own history?
A structured way to assess lead-lag relationships between product metrics

What It Is Not

Not proof of causation: A third variable might drive both $X$ and $Y$ with different lags
Not a mechanism: It does not tell you how $X$ affects $Y$
Not immune to confounding: Unmeasured variables can create spurious Granger causality

Prerequisites: Making Data Stationary

Why Stationarity Matters

Granger causality tests assume stationary time series. Non-stationary data (data with trends, unit roots, or changing variance) produces spurious results. Two unrelated metrics that both trend upward will appear to Granger-cause each other.

Testing for Stationarity

Use the Augmented Dickey-Fuller (ADF) test:

from statsmodels.tsa.stattools import adfuller

# Test feature_usage for stationarity
result = adfuller(feature_usage, maxlag=14, regression='ct')
print(f"ADF Statistic: {result[0]:.4f}")
print(f"p-value: {result[1]:.4f}")
# p > 0.05 => non-stationary, needs differencing

Achieving Stationarity

If either series is non-stationary (p > 0.05 on the ADF test), apply differencing:

# First differencing: compute day-over-day changes
feature_usage_diff = feature_usage.diff().dropna()
retention_diff = retention.diff().dropna()

Re-test with ADF after differencing. Most product metrics become stationary after first differencing. If not, apply second differencing.

Important: After differencing, you are testing whether changes in feature usage predict changes in retention, rather than whether levels predict levels. This is actually the more meaningful question for most product contexts.

Running the Granger Causality Test

Implementation

from statsmodels.tsa.stattools import grangercausalitytests
import pandas as pd

# Combine into a DataFrame (Y first, X second)
data = pd.DataFrame({
    'retention': retention_diff,
    'feature_usage': feature_usage_diff
}).dropna()

# Test: Does feature_usage Granger-cause retention?
print("Does feature usage Granger-cause retention?")
result = grangercausalitytests(
    data[['retention', 'feature_usage']],
    maxlag=14
)

The function tests at each lag from 1 to maxlag and reports four test statistics. The F-test is the standard one to report.

Interpreting Results

The output shows p-values for each lag:

Lag 1: F-test p-value = 0.0023  (significant)
Lag 2: F-test p-value = 0.0041  (significant)
...
Lag 7: F-test p-value = 0.0156  (significant)
Lag 8: F-test p-value = 0.1234  (not significant)

This would suggest that feature usage from 1-7 days ago helps predict today's retention. The lag structure is itself informative: a 1-7 day lag aligns with the idea that weekly feature usage affects weekly retention.

Testing Both Directions

Always test bidirectional causality:

# Does retention Granger-cause feature_usage?
print("Does retention Granger-cause feature usage?")
result_reverse = grangercausalitytests(
    data[['feature_usage', 'retention']],
    maxlag=14
)

Four possible outcomes:

Feature -> Retention	Retention -> Feature	Interpretation
Significant	Not significant	Feature usage predicts retention (consistent with feature driving retention)
Not significant	Significant	Retention predicts feature usage (retained users use features more)
Significant	Significant	Feedback loop (both predict each other)
Not significant	Not significant	No temporal predictive relationship

VAR Models: The Multivariate Framework

Beyond Pairwise Tests

Granger causality between two variables can be confounded by a third. Vector Autoregression (VAR) models extend the framework to multiple variables simultaneously.

from statsmodels.tsa.api import VAR

# Include multiple metrics
data = pd.DataFrame({
    'retention': retention_diff,
    'feature_usage': feature_usage_diff,
    'sessions': sessions_diff,
    'support_tickets': support_diff
}).dropna()

# Fit VAR model
model = VAR(data)
lag_order = model.select_order(maxlags=14)
print(lag_order.summary())  # Select optimal lag via AIC/BIC

results = model.fit(lag_order.aic)
print(results.summary())

Granger Causality in VAR

After fitting the VAR, test Granger causality while controlling for other variables:

# Test: Does feature_usage Granger-cause retention,
# controlling for sessions and support_tickets?
gc_result = results.test_causality(
    'retention',
    ['feature_usage'],
    kind='f'
)
print(gc_result.summary())

This is more reliable than pairwise testing because it controls for confounders that you have measured.

Lag Selection

Why It Matters

The number of lags determines the time horizon over which you test the predictive relationship. Too few lags: you might miss the true lead-lag relationship. Too many lags: you reduce statistical power and risk overfitting.

Choosing Lags with Information Criteria

model = VAR(data)
lag_order = model.select_order(maxlags=21)

# AIC and BIC often disagree -- BIC penalizes complexity more
print(f"AIC suggests: {lag_order.aic} lags")
print(f"BIC suggests: {lag_order.bic} lags")

For daily product data:

7 lags: Captures one week of predictive horizon (good default)
14 lags: Captures two weeks (use when weekly cycles are strong)
Let AIC/BIC decide: Most principled approach

Product Analytics Examples

Does Push Notification Usage Predict DAU?

Hypothesis: Users who interact with push notifications are more likely to return.

data = pd.DataFrame({
    'dau_change': dau.diff().dropna(),
    'push_opens_change': push_opens.diff().dropna()
})

# Test: push_opens -> DAU
grangercausalitytests(data[['dau_change', 'push_opens_change']], maxlag=7)

If significant at lags 1-3, push notification engagement predicts next-day DAU increases, consistent with push notifications driving return visits.

Does Content Creation Drive Consumption?

Hypothesis: More content creation leads to more content consumption (supply drives demand).

Test both directions -- in many platforms, consumption also drives creation (users who consume are inspired to create). A bidirectional finding reveals a flywheel effect.

Does Onboarding Completion Predict Week 2 Activity?

Hypothesis: Users who complete onboarding in week 1 are more active in week 2.

Use weekly aggregated data (not daily) since the relationship operates at a weekly cadence. Difference the weekly series and test with 1-4 week lags.

Avoiding Spurious Granger Causality

Common Traps

Shared trends: Two trending metrics will appear to Granger-cause each other. Always difference first.

Shared seasonality: Metrics with the same weekly cycle will show spurious Granger causality at lag 7. Deseasonalize before testing.

External shocks: A marketing campaign boosts both feature usage and retention. The temporal overlap creates apparent Granger causality. Include the confounding variable in a VAR model to control for it.

Multiple testing: Testing many metric pairs inflates false positives. Apply Bonferroni or Benjamini-Hochberg correction when testing multiple pairs.

Robustness Checks

Test both directions: One-directional Granger causality is more convincing than bidirectional (which may indicate confounding).
Vary the lag order: If the result is only significant at one specific lag and not nearby lags, be skeptical.
Check different time periods: Does the relationship hold in different months or quarters?
Include controls: Add relevant metrics to the VAR to control for confounders.
Combine with experiments: Granger causality generates hypotheses; A/B tests or interrupted time series test them.

From Granger Causality to Action

Granger causality is a starting point, not an endpoint. The workflow is:

Observe a correlation between feature usage and retention
Test with Granger causality to establish temporal precedence
Control for confounders using VAR
Validate with an experiment (A/B test or quasi-experiment)
Act based on the accumulated evidence

A Granger causality finding that feature usage predicts retention (but not vice versa), controlling for other metrics, combined with a successful A/B test of a feature promotion, provides strong evidence for allocating resources to that feature. Either piece alone is weaker. Together, they tell a compelling story.

References

https://en.wikipedia.org/wiki/Granger_causality
https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.grangercausalitytests.html
https://doi.org/10.2307/1912791
Granger, C. W. J. (1969). Investigating causal relations by econometric models and cross-spectral methods. *Econometrica*, 37(3), 424-438.
Luetkepohl, H. (2005). *New Introduction to Multiple Time Series Analysis*. Springer.

Frequently Asked Questions

Does Granger causality prove actual causation?

No. Granger causality is predictive, not mechanistic. If X Granger-causes Y, it means past values of X help predict future Y beyond what past Y alone can predict. But this could be because X truly causes Y, because both are driven by an unmeasured third variable, or because of other confounding. It is evidence for a temporal relationship, not proof of causation.

How do I choose the number of lags for the test?

Use information criteria (AIC or BIC) to select the optimal lag order. Fit VAR models with different lag orders and choose the one that minimizes AIC or BIC. In practice, for daily product data, start with 7-14 lags (1-2 weeks) and let the information criterion decide. Too few lags miss the relationship; too many lags reduce power.

What if both directions are significant?

Bidirectional Granger causality is common and informative. If feature usage Granger-causes retention AND retention Granger-causes feature usage, there is a feedback loop: engaged users use features more, and using features keeps them engaged. This is valuable for understanding product dynamics, even though it does not identify the original cause.

Key Takeaway

Granger causality tests whether one metric's past values help predict another metric's future values. It is a rigorous way to assess lead-lag relationships between product metrics, but it is not proof of true causation. Always make data stationary first, test both directions, and use the results as evidence to be combined with domain knowledge, not as standalone causal claims.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email