Model Evaluation

Drift Detection: KS Test, PSI, and Interpreting Signals

How to detect when your model's inputs or outputs have shifted. Learn about KS tests, Population Stability Index, and when drift actually matters.

Share

Quick Hits

  • Data drift: input distribution changes (features look different)
  • Concept drift: relationship between inputs and outputs changes (same features, different outcomes)
  • KS test: detects any distribution change, sensitive with large samples
  • PSI: practical threshold-based metric (>0.25 = investigate, >0.5 = action)
  • Statistical significance ≠ practical significance—small drift with large n is significant but harmless

TL;DR

Drift detection identifies when your model's environment has changed. Data drift means inputs look different; concept drift means the input-output relationship has changed. The KS test detects distribution changes but is oversensitive with large samples. PSI (Population Stability Index) provides practical thresholds for action. The key insight: statistically significant drift doesn't always mean problematic drift—focus on whether model performance actually degrades.


Types of Drift

Data Drift (Covariate Shift)

Input distribution changes, but the relationship between inputs and outputs stays the same.

Example: Your user base shifts younger

  • Feature age distribution changes
  • But P(purchase | age) relationship stays the same
  • Model trained on older users may extrapolate poorly

Concept Drift

Input-output relationship changes, even if input distribution is stable.

Example: Economic conditions change

  • Same features (income, employment)
  • But purchase behavior changes dramatically
  • Model predicts poorly despite seeing familiar inputs

Prediction Drift

Model output distribution changes (regardless of cause).

Example: Model confidence shifts

  • Model starts predicting more confidently
  • Could be data drift, could be a bug
  • Worth investigating either way

Kolmogorov-Smirnov (KS) Test

The Method

Compares two distributions by maximum difference between CDFs:

$$D = \max_x |F_1(x) - F_2(x)|$$

import numpy as np
from scipy import stats


def ks_drift_test(reference, current, alpha=0.05):
    """
    Two-sample KS test for drift detection.

    Parameters:
    -----------
    reference : array
        Baseline distribution (training or reference period)
    current : array
        Current distribution to compare

    Returns:
    --------
    dict with statistic, p-value, and interpretation
    """
    statistic, p_value = stats.ks_2samp(reference, current)

    # Effect size interpretation
    if statistic < 0.05:
        effect = "Negligible"
    elif statistic < 0.10:
        effect = "Small"
    elif statistic < 0.20:
        effect = "Medium"
    else:
        effect = "Large"

    return {
        'ks_statistic': statistic,
        'p_value': p_value,
        'statistically_significant': p_value < alpha,
        'effect_size': effect,
        'interpretation': f"{'Significant' if p_value < alpha else 'No significant'} drift detected (D={statistic:.3f})"
    }


# Example: Feature drift
np.random.seed(42)

# Reference period
reference = np.random.normal(50, 10, 10000)  # Age ~N(50, 10)

# Current period with slight drift
current = np.random.normal(48, 12, 10000)  # Younger, more variable

result = ks_drift_test(reference, current)

print("KS Drift Test")
print("=" * 50)
print(f"Reference: n={len(reference)}, mean={reference.mean():.1f}, std={reference.std():.1f}")
print(f"Current: n={len(current)}, mean={current.mean():.1f}, std={current.std():.1f}")
print(f"\nKS Statistic: {result['ks_statistic']:.4f}")
print(f"p-value: {result['p_value']:.4e}")
print(f"Effect size: {result['effect_size']}")
print(f"Interpretation: {result['interpretation']}")

The Problem with Large Samples

def demonstrate_ks_sensitivity():
    """
    KS test is overly sensitive with large samples.
    """
    np.random.seed(42)

    # Tiny drift: mean differs by 0.5 (5% of std)
    results = []

    for n in [100, 1000, 10000, 100000]:
        ref = np.random.normal(50, 10, n)
        cur = np.random.normal(50.5, 10, n)  # Tiny shift

        result = ks_drift_test(ref, cur)
        results.append({
            'n': n,
            'ks_stat': result['ks_statistic'],
            'p_value': result['p_value'],
            'significant': result['statistically_significant']
        })

    print("KS Sensitivity to Sample Size")
    print("(Same tiny drift: 0.5 unit mean shift)")
    print("=" * 60)
    print(f"{'n':>10} {'KS Stat':>12} {'p-value':>15} {'Significant':>12}")
    print("-" * 60)
    for r in results:
        print(f"{r['n']:>10,} {r['ks_stat']:>12.4f} {r['p_value']:>15.2e} {str(r['significant']):>12}")


demonstrate_ks_sensitivity()

Population Stability Index (PSI)

The Method

Compares distributions bin-by-bin:

$$\text{PSI} = \sum_{i=1}^{k} (A_i - E_i) \cdot \ln\left(\frac{A_i}{E_i}\right)$$

Where:

  • $A_i$ = actual (current) proportion in bin i
  • $E_i$ = expected (reference) proportion in bin i

Implementation

def psi(reference, current, n_bins=10):
    """
    Population Stability Index for drift detection.

    Thresholds:
    - PSI < 0.10: No significant change
    - 0.10 ≤ PSI < 0.25: Moderate change, investigate
    - PSI ≥ 0.25: Significant change, action needed
    """
    # Define bins from reference
    _, bin_edges = np.histogram(reference, bins=n_bins)
    bin_edges[0] = -np.inf
    bin_edges[-1] = np.inf

    # Count proportions
    ref_counts, _ = np.histogram(reference, bins=bin_edges)
    cur_counts, _ = np.histogram(current, bins=bin_edges)

    ref_props = ref_counts / len(reference)
    cur_props = cur_counts / len(current)

    # Avoid division by zero
    ref_props = np.maximum(ref_props, 0.001)
    cur_props = np.maximum(cur_props, 0.001)

    # PSI calculation
    psi_values = (cur_props - ref_props) * np.log(cur_props / ref_props)
    psi_total = np.sum(psi_values)

    # Interpretation
    if psi_total < 0.10:
        interpretation = "No significant change"
        action = "None required"
    elif psi_total < 0.25:
        interpretation = "Moderate change"
        action = "Investigate"
    else:
        interpretation = "Significant change"
        action = "Action needed"

    return {
        'psi': psi_total,
        'bin_psi': psi_values.tolist(),
        'interpretation': interpretation,
        'action': action,
        'ref_props': ref_props.tolist(),
        'cur_props': cur_props.tolist()
    }


# Example
np.random.seed(42)
reference = np.random.normal(50, 10, 5000)
current_mild = np.random.normal(52, 10, 5000)  # Mild drift
current_severe = np.random.normal(60, 15, 5000)  # Severe drift

psi_mild = psi(reference, current_mild)
psi_severe = psi(reference, current_severe)

print("PSI Drift Detection")
print("=" * 50)
print("\nMild drift (mean: 50 → 52):")
print(f"  PSI: {psi_mild['psi']:.4f}")
print(f"  {psi_mild['interpretation']} - {psi_mild['action']}")

print("\nSevere drift (mean: 50 → 60, std: 10 → 15):")
print(f"  PSI: {psi_severe['psi']:.4f}")
print(f"  {psi_severe['interpretation']} - {psi_severe['action']}")

PSI Thresholds

PSI Value Interpretation Action
< 0.10 No significant change Continue monitoring
0.10 - 0.25 Moderate change Investigate causes
0.25 - 0.50 Significant change Review model performance
> 0.50 Severe change Consider retraining

Multi-Feature Drift Detection

def multi_feature_drift(reference_df, current_df, method='psi'):
    """
    Check drift across multiple features.
    """
    results = []

    for column in reference_df.columns:
        ref = reference_df[column].values
        cur = current_df[column].values

        if method == 'psi':
            result = psi(ref, cur)
            metric = result['psi']
            alert = result['psi'] >= 0.25
        else:  # ks
            result = ks_drift_test(ref, cur)
            metric = result['ks_statistic']
            alert = result['statistically_significant'] and result['ks_statistic'] > 0.1

        results.append({
            'feature': column,
            'metric': metric,
            'alert': alert
        })

    return sorted(results, key=lambda x: -x['metric'])


# Example: Multiple features
import pandas as pd

np.random.seed(42)
n = 5000

reference_df = pd.DataFrame({
    'age': np.random.normal(45, 15, n),
    'income': np.random.lognormal(10.5, 0.5, n),
    'tenure': np.random.exponential(5, n),
    'clicks': np.random.poisson(10, n)
})

# Current with drift in some features
current_df = pd.DataFrame({
    'age': np.random.normal(42, 15, n),  # Shifted
    'income': np.random.lognormal(10.5, 0.5, n),  # Same
    'tenure': np.random.exponential(4, n),  # Shifted
    'clicks': np.random.poisson(10, n)  # Same
})

drift_results = multi_feature_drift(reference_df, current_df, method='psi')

print("Multi-Feature Drift Report")
print("=" * 50)
print(f"{'Feature':<15} {'PSI':>10} {'Alert':>10}")
print("-" * 50)
for r in drift_results:
    alert_str = "⚠️ YES" if r['alert'] else "No"
    print(f"{r['feature']:<15} {r['metric']:>10.4f} {alert_str:>10}")

Prediction Drift Monitoring

def monitor_predictions(reference_preds, current_preds, thresholds=None):
    """
    Monitor drift in model predictions.
    """
    if thresholds is None:
        thresholds = {'ks': 0.1, 'psi': 0.25, 'mean_shift': 0.1}

    # KS test
    ks_result = ks_drift_test(reference_preds, current_preds)

    # PSI
    psi_result = psi(reference_preds, current_preds)

    # Mean shift
    mean_ref = np.mean(reference_preds)
    mean_cur = np.mean(current_preds)
    mean_shift = abs(mean_cur - mean_ref) / (np.std(reference_preds) + 1e-10)

    alerts = []
    if ks_result['ks_statistic'] > thresholds['ks']:
        alerts.append(f"KS statistic {ks_result['ks_statistic']:.3f} > {thresholds['ks']}")
    if psi_result['psi'] > thresholds['psi']:
        alerts.append(f"PSI {psi_result['psi']:.3f} > {thresholds['psi']}")
    if mean_shift > thresholds['mean_shift']:
        alerts.append(f"Mean shift {mean_shift:.3f} std > {thresholds['mean_shift']}")

    return {
        'ks': ks_result,
        'psi': psi_result,
        'mean_shift': mean_shift,
        'mean_reference': mean_ref,
        'mean_current': mean_cur,
        'alerts': alerts,
        'status': 'ALERT' if alerts else 'OK'
    }


# Example: Model prediction monitoring
np.random.seed(42)
ref_preds = np.random.beta(5, 2, 5000)  # Reference predictions
cur_preds = np.random.beta(4, 2, 5000)  # Slightly shifted

monitor_result = monitor_predictions(ref_preds, cur_preds)

print("Prediction Drift Monitor")
print("=" * 50)
print(f"Reference: mean={monitor_result['mean_reference']:.3f}")
print(f"Current: mean={monitor_result['mean_current']:.3f}")
print(f"\nKS Statistic: {monitor_result['ks']['ks_statistic']:.4f}")
print(f"PSI: {monitor_result['psi']['psi']:.4f}")
print(f"Mean Shift: {monitor_result['mean_shift']:.3f} std")
print(f"\nStatus: {monitor_result['status']}")
if monitor_result['alerts']:
    print("Alerts:")
    for alert in monitor_result['alerts']:
        print(f"  - {alert}")

When Drift Matters

Decision Framework

DRIFT DETECTED
       ↓
QUESTION: Is model performance degrading?
├── Yes → Investigate and potentially retrain
└── No → Continue monitoring
       ↓
QUESTION: Is the drift expected?
├── Yes (seasonal, known changes) → Document and monitor
└── No (unexpected) → Investigate root cause
       ↓
QUESTION: Is drift in important features?
├── Yes (top predictors) → Higher priority
└── No (low-importance features) → Lower priority
       ↓
ACTION:
- Minor drift, stable performance → Continue monitoring
- Moderate drift, slight degradation → Investigate
- Severe drift, degraded performance → Retrain or alert

Correlation with Performance

def drift_performance_analysis(ref_X, ref_y, cur_X, cur_y, model, feature_names):
    """
    Analyze relationship between feature drift and performance.
    """
    from sklearn.metrics import roc_auc_score

    # Feature drift
    drift_scores = []
    for i, name in enumerate(feature_names):
        result = psi(ref_X[:, i], cur_X[:, i])
        drift_scores.append({
            'feature': name,
            'psi': result['psi']
        })

    # Performance
    ref_auc = roc_auc_score(ref_y, model.predict_proba(ref_X)[:, 1])
    cur_auc = roc_auc_score(cur_y, model.predict_proba(cur_X)[:, 1])

    return {
        'drift_scores': sorted(drift_scores, key=lambda x: -x['psi']),
        'ref_auc': ref_auc,
        'cur_auc': cur_auc,
        'auc_change': cur_auc - ref_auc,
        'performance_degraded': cur_auc < ref_auc - 0.02
    }

R Implementation

# KS test
ks.test(reference, current)

# PSI function
psi <- function(ref, cur, n_bins = 10) {
    breaks <- quantile(ref, probs = seq(0, 1, length.out = n_bins + 1))
    ref_props <- table(cut(ref, breaks, include.lowest = TRUE)) / length(ref)
    cur_props <- table(cut(cur, breaks, include.lowest = TRUE)) / length(cur)

    # Avoid zeros
    ref_props <- pmax(ref_props, 0.001)
    cur_props <- pmax(cur_props, 0.001)

    sum((cur_props - ref_props) * log(cur_props / ref_props))
}


Key Takeaway

Drift detection tells you when the world has changed—input distributions shifting (data drift) or relationships changing (concept drift). Use KS tests for statistical detection but beware of oversensitivity with large samples. PSI provides practical thresholds that focus on actionable drift levels. The key question isn't "has the distribution changed?" (it probably has), but "does this change affect my model's usefulness?" Monitor features, predictions, and outcomes together; drift that doesn't hurt performance may not need action.


References

  1. https://arxiv.org/abs/1908.04240
  2. https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality-baseline.html
  3. https://www.jstor.org/stable/2280095
  4. Lu, J., Liu, A., Dong, F., et al. (2018). Learning under concept drift: A review. *IEEE TKDE*, 31(12), 2346-2363.
  5. Yeh, I. C. (2007). Dataset documentation: Default of credit card clients. *UCI ML Repository*.
  6. Kolmogorov, A. (1933). Sulla determinazione empirica di una legge di distribuzione. *Giornale dell'Istituto Italiano degli Attuari*, 4, 83-91.

Frequently Asked Questions

What's the difference between data drift and concept drift?
Data drift = input features have changed distribution (users are older, requests are longer). Concept drift = the relationship between features and target has changed (same users behave differently). Data drift can cause problems without concept drift (model hasn't seen these inputs), and concept drift can occur without data drift (same inputs, different outcomes).
My KS test is always significant—is that a problem?
Not necessarily. With large samples, KS detects tiny differences that don't matter. Use PSI with practical thresholds instead, or interpret KS p-values alongside effect size (the KS statistic itself). A p < 0.001 with KS statistic of 0.02 is statistically but not practically significant.
How often should I check for drift?
Depends on your domain. High-frequency (daily/weekly) for user behavior models, lower frequency (monthly) for stable domains. Set up automated monitoring with PSI thresholds that trigger alerts only for meaningful drift levels.

Key Takeaway

Drift detection tells you when distributions have changed—inputs looking different (data drift) or input-output relationships changing (concept drift). Use KS tests for statistical detection and PSI for practical significance. Be careful with large samples: statistically significant drift can be practically irrelevant. Focus on whether drift affects model performance, not just whether it exists. Monitor features, predictions, and outcomes together to distinguish harmless shift from model degradation.

Send to a friend

Share this with someone who loves clean statistical work.