Assumptions

Independence: The Silent Killer of Statistical Validity

The independence assumption is the most critical and most commonly violated. Learn to detect non-independence from repeated measures, clustering, and time series—and what to do about it.

Share

Quick Hits

  • Independence violations can't be fixed with robust methods—they require structural solutions
  • Multiple observations per user is the most common violation in product analytics
  • Ignoring clustering can inflate effective sample size by 10x or more
  • Solutions include mixed models, clustered standard errors, or aggregation

TL;DR

Independence violations are the most dangerous assumption problems because they can't be fixed with robust methods. When observations are correlated (repeated measures, clustering, time series), standard errors shrink artificially, making everything look more significant than it is. The solutions require restructuring your analysis: aggregate to independent units, use clustered standard errors, or fit models that account for the correlation structure.


What Independence Means

The Formal Definition

Observations are independent if knowing the value of one tells you nothing about another:

$$P(Y_i | Y_j) = P(Y_i) \text{ for all } i \neq j$$

Practical Translation

Ask yourself: "If I know user A's value, does that help me predict user B's value?"

  • Independent: Different users randomly sampled from different sources
  • Not independent: Multiple purchases from the same user, students in the same classroom, time-series data
import numpy as np
from scipy import stats
import pandas as pd

def independence_checklist():
    """
    Common sources of non-independence in product analytics.
    """
    violations = {
        'Repeated measures': {
            'example': 'Multiple sessions/purchases per user',
            'why_problematic': 'Users are consistent with themselves',
            'effective_n': 'Much smaller than observation count'
        },
        'Clustering': {
            'example': 'Users in same company, classroom, region',
            'why_problematic': 'Within-cluster similarity',
            'effective_n': 'Closer to number of clusters than observations'
        },
        'Time series': {
            'example': 'Daily/weekly metrics',
            'why_problematic': 'Autocorrelation - today predicts tomorrow',
            'effective_n': 'Depends on correlation structure'
        },
        'Network effects': {
            'example': 'Users who interact with each other',
            'why_problematic': 'Treatment spreads through network',
            'effective_n': 'May need network randomization'
        },
        'Spatial correlation': {
            'example': 'Users in nearby locations',
            'why_problematic': 'Geographic similarity',
            'effective_n': 'Depends on spatial structure'
        }
    }
    return violations

The Damage: Inflated Type I Error

Simulation: How Bad Can It Get?

def simulate_clustering_damage(n_sims=5000):
    """
    Show how clustering inflates Type I error.
    """
    np.random.seed(42)

    # Scenario: 20 clusters of 10 observations each = 200 total
    n_clusters = 20
    cluster_size = 10
    n_total = n_clusters * cluster_size

    # Vary intraclass correlation (ICC)
    iccs = [0, 0.1, 0.3, 0.5, 0.7]

    results = {}
    for icc in iccs:
        # ICC determines how much variance is between vs within clusters
        between_var = icc
        within_var = 1 - icc

        rejections_naive = 0
        rejections_clustered = 0

        for _ in range(n_sims):
            # Generate null data (no treatment effect)
            data = []
            for cluster in range(n_clusters):
                # Half clusters are "treatment", half are "control"
                treatment = cluster >= n_clusters // 2
                cluster_effect = np.random.normal(0, np.sqrt(between_var))

                for _ in range(cluster_size):
                    obs = cluster_effect + np.random.normal(0, np.sqrt(within_var))
                    data.append({
                        'cluster': cluster,
                        'treatment': treatment,
                        'value': obs
                    })

            df = pd.DataFrame(data)

            # Naive analysis (ignores clustering)
            control = df[~df['treatment']]['value']
            treat = df[df['treatment']]['value']
            _, p_naive = stats.ttest_ind(control, treat)
            if p_naive < 0.05:
                rejections_naive += 1

            # Correct analysis (cluster means)
            cluster_means = df.groupby(['treatment', 'cluster'])['value'].mean()
            control_means = cluster_means[False].values
            treat_means = cluster_means[True].values
            _, p_correct = stats.ttest_ind(control_means, treat_means)
            if p_correct < 0.05:
                rejections_clustered += 1

        results[icc] = {
            'naive': rejections_naive / n_sims,
            'correct': rejections_clustered / n_sims
        }

    print("Type I Error with Clustered Data (α = 0.05)")
    print("=" * 50)
    print(f"{'ICC':>6} {'Naive':>12} {'Clustered':>12} {'Inflation':>12}")
    print("-" * 50)
    for icc, res in results.items():
        inflation = res['naive'] / 0.05
        print(f"{icc:>6.1f} {res['naive']:>12.3f} {res['correct']:>12.3f} {inflation:>11.1f}x")

    return results


simulate_clustering_damage()

What's Actually Happening

def explain_effective_sample_size():
    """
    Explain the effective sample size concept.
    """
    print("EFFECTIVE SAMPLE SIZE")
    print("=" * 50)
    print()
    print("With clustering, your effective n is:")
    print()
    print("  n_eff = n / (1 + (m - 1) * ICC)")
    print()
    print("Where:")
    print("  n = total observations")
    print("  m = average cluster size")
    print("  ICC = intraclass correlation")
    print()
    print("Example: 1000 purchases from 100 users (m = 10 each)")
    print("-" * 50)

    n_total = 1000
    m = 10  # 10 observations per user

    for icc in [0.1, 0.3, 0.5, 0.7]:
        n_eff = n_total / (1 + (m - 1) * icc)
        print(f"ICC = {icc}: n_eff = {n_eff:.0f} (not {n_total})")


explain_effective_sample_size()

Detecting Non-Independence

Check 1: Data Structure

def check_data_structure(df, unit_col, observation_col=None):
    """
    Identify potential independence violations from data structure.
    """
    # Count observations per unit
    obs_per_unit = df.groupby(unit_col).size()

    results = {
        'total_observations': len(df),
        'unique_units': df[unit_col].nunique(),
        'obs_per_unit': {
            'mean': obs_per_unit.mean(),
            'median': obs_per_unit.median(),
            'max': obs_per_unit.max(),
            'min': obs_per_unit.min()
        },
        'units_with_multiple': (obs_per_unit > 1).sum(),
        'pct_from_repeat_units': (obs_per_unit[obs_per_unit > 1].sum() /
                                   len(df) * 100 if (obs_per_unit > 1).any() else 0)
    }

    # Diagnosis
    if results['obs_per_unit']['mean'] > 1.5:
        results['warning'] = (
            f"⚠️  Average of {results['obs_per_unit']['mean']:.1f} observations per unit. "
            f"Independence is likely violated."
        )

    return results


# Example
np.random.seed(42)
n_users = 50
data = []
for user in range(n_users):
    n_sessions = np.random.poisson(5) + 1
    for session in range(n_sessions):
        data.append({
            'user_id': user,
            'session_value': np.random.normal(100, 20)
        })

df = pd.DataFrame(data)
structure = check_data_structure(df, 'user_id')

print("Data Structure Analysis:")
print("-" * 40)
print(f"Total observations: {structure['total_observations']}")
print(f"Unique users: {structure['unique_units']}")
print(f"Mean obs per user: {structure['obs_per_unit']['mean']:.1f}")
print(f"Max obs per user: {structure['obs_per_unit']['max']}")
if 'warning' in structure:
    print(f"\n{structure['warning']}")

Check 2: Intraclass Correlation (ICC)

def calculate_icc(df, group_col, value_col):
    """
    Calculate intraclass correlation coefficient.
    """
    # Get group means and sizes
    groups = df.groupby(group_col)[value_col]
    group_means = groups.mean()
    group_sizes = groups.size()
    grand_mean = df[value_col].mean()

    n_groups = len(group_means)
    n_total = len(df)
    avg_size = n_total / n_groups

    # Between-group variance
    ss_between = sum(group_sizes * (group_means - grand_mean)**2)
    ms_between = ss_between / (n_groups - 1)

    # Within-group variance
    ss_within = sum(groups.apply(lambda x: sum((x - x.mean())**2)))
    ms_within = ss_within / (n_total - n_groups)

    # ICC (one-way random effects)
    icc = (ms_between - ms_within) / (ms_between + (avg_size - 1) * ms_within)
    icc = max(0, icc)  # Can't be negative

    # Effective sample size
    n_eff = n_total / (1 + (avg_size - 1) * icc)

    return {
        'icc': icc,
        'n_total': n_total,
        'n_groups': n_groups,
        'avg_group_size': avg_size,
        'effective_n': n_eff,
        'design_effect': 1 + (avg_size - 1) * icc
    }


# Calculate ICC for our example
icc_result = calculate_icc(df, 'user_id', 'session_value')

print("Intraclass Correlation Analysis:")
print("-" * 40)
print(f"ICC: {icc_result['icc']:.3f}")
print(f"Total observations: {icc_result['n_total']}")
print(f"Effective sample size: {icc_result['effective_n']:.0f}")
print(f"Design effect: {icc_result['design_effect']:.1f}")

Check 3: Autocorrelation (Time Series)

from scipy.stats import pearsonr

def check_autocorrelation(series, max_lag=5):
    """
    Check for autocorrelation in time series data.
    """
    series = np.array(series)
    n = len(series)

    results = {'lags': []}
    for lag in range(1, min(max_lag + 1, n // 4)):
        corr, p = pearsonr(series[:-lag], series[lag:])
        results['lags'].append({
            'lag': lag,
            'correlation': corr,
            'p_value': p
        })

    # Durbin-Watson statistic
    residuals = series - np.mean(series)
    dw = np.sum(np.diff(residuals)**2) / np.sum(residuals**2)
    results['durbin_watson'] = dw
    # DW ≈ 2 means no autocorrelation
    # DW < 2 means positive autocorrelation
    # DW > 2 means negative autocorrelation

    return results


# Example: Daily metrics
np.random.seed(42)
daily_metric = np.cumsum(np.random.normal(0.1, 1, 100)) + 100  # Random walk

auto_result = check_autocorrelation(daily_metric)
print("Autocorrelation Check:")
print("-" * 40)
print(f"Durbin-Watson: {auto_result['durbin_watson']:.2f} (2 = no autocorrelation)")
print("\nLag correlations:")
for lag_info in auto_result['lags'][:3]:
    print(f"  Lag {lag_info['lag']}: r = {lag_info['correlation']:.3f}, "
          f"p = {lag_info['p_value']:.4f}")

Solutions

Solution 1: Aggregate to Independent Units

The simplest solution: collapse data to one observation per independent unit.

def aggregate_solution(df, unit_col, value_col, group_col=None):
    """
    Aggregate to unit level to restore independence.
    """
    if group_col:
        # Aggregate within groups
        aggregated = df.groupby([group_col, unit_col]).agg({
            value_col: ['mean', 'std', 'count']
        }).reset_index()
        aggregated.columns = [group_col, unit_col, 'mean', 'std', 'n_obs']
    else:
        aggregated = df.groupby(unit_col).agg({
            value_col: ['mean', 'std', 'count']
        }).reset_index()
        aggregated.columns = [unit_col, 'mean', 'std', 'n_obs']

    return aggregated


# Example: Compare treatment vs control
np.random.seed(42)
data = []
for user in range(100):
    treatment = user >= 50
    user_effect = np.random.normal(0, 5)
    treatment_effect = 3 if treatment else 0

    for session in range(np.random.poisson(5) + 1):
        data.append({
            'user_id': user,
            'treatment': treatment,
            'value': 50 + user_effect + treatment_effect + np.random.normal(0, 10)
        })

df = pd.DataFrame(data)

# Wrong: observation-level analysis
control_obs = df[~df['treatment']]['value']
treat_obs = df[df['treatment']]['value']
_, p_wrong = stats.ttest_ind(control_obs, treat_obs)

# Right: user-level analysis
user_means = df.groupby(['treatment', 'user_id'])['value'].mean().reset_index()
control_users = user_means[~user_means['treatment']]['value']
treat_users = user_means[user_means['treatment']]['value']
_, p_right = stats.ttest_ind(control_users, treat_users)

print("Impact of Aggregation:")
print("-" * 40)
print(f"Observation-level (WRONG): n = {len(df)}, p = {p_wrong:.4f}")
print(f"User-level (RIGHT): n = {len(user_means)}, p = {p_right:.4f}")

Solution 2: Clustered Standard Errors

For regression, adjust standard errors to account for clustering.

import statsmodels.api as sm

def clustered_regression(df, outcome_col, predictor_cols, cluster_col):
    """
    Regression with clustered standard errors.
    """
    y = df[outcome_col]
    X = sm.add_constant(df[predictor_cols])

    # Standard OLS
    model_standard = sm.OLS(y, X).fit()

    # With clustered standard errors
    model_clustered = sm.OLS(y, X).fit(cov_type='cluster',
                                        cov_kwds={'groups': df[cluster_col]})

    return model_standard, model_clustered


# Example
np.random.seed(42)

# Create clustered data with treatment effect
data = []
for cluster in range(50):
    treatment = cluster >= 25
    cluster_effect = np.random.normal(0, 10)

    for _ in range(np.random.poisson(10) + 1):
        value = 50 + cluster_effect + (5 if treatment else 0) + np.random.normal(0, 5)
        data.append({
            'cluster': cluster,
            'treatment': int(treatment),
            'value': value
        })

df = pd.DataFrame(data)

model_std, model_clust = clustered_regression(df, 'value', ['treatment'], 'cluster')

print("Regression Results Comparison:")
print("-" * 50)
print(f"{'':20} {'Standard SE':>15} {'Clustered SE':>15}")
print("-" * 50)

for var in ['const', 'treatment']:
    se_std = model_std.bse[var]
    se_clust = model_clust.bse[var]
    ratio = se_clust / se_std
    print(f"{var:20} {se_std:>15.3f} {se_clust:>15.3f} ({ratio:.1f}x)")

print("\nP-values for treatment:")
print(f"  Standard: {model_std.pvalues['treatment']:.4f}")
print(f"  Clustered: {model_clust.pvalues['treatment']:.4f}")

Solution 3: Mixed Effects Models

For complex structures, use multilevel/mixed models.

import statsmodels.formula.api as smf

def mixed_model_analysis(df, outcome, fixed_effects, random_effects):
    """
    Fit mixed effects model.
    """
    formula = f"{outcome} ~ {fixed_effects}"
    model = smf.mixedlm(formula, df, groups=random_effects).fit()
    return model


# Example: Users nested in companies
np.random.seed(42)
data = []
for company in range(20):
    company_effect = np.random.normal(0, 10)
    treatment = company >= 10

    for user in range(np.random.randint(5, 15)):
        user_effect = np.random.normal(0, 5)

        for session in range(np.random.poisson(3) + 1):
            value = (50 + company_effect + user_effect +
                    (5 if treatment else 0) + np.random.normal(0, 3))
            data.append({
                'company': company,
                'user': f"{company}_{user}",
                'treatment': int(treatment),
                'value': value
            })

df = pd.DataFrame(data)

# Fit mixed model (random intercept for company)
model = smf.mixedlm("value ~ treatment", df, groups="company").fit()

print("Mixed Model Results:")
print("-" * 50)
print(model.summary().tables[1])

Decision Framework

Do observations repeat within units (users, clusters, time)?
│
├── NO → Standard methods OK
│
└── YES → How to handle?
          │
          ├── Simple structure (one observation per unit possible)?
          │   └── Aggregate to unit means
          │
          ├── Need observation-level analysis?
          │   ├── Regression → Clustered standard errors
          │   └── Complex nesting → Mixed effects models
          │
          └── Time series?
              └── Time series methods (ARIMA, etc.)

R Implementation

# Check for clustering
check_clustering <- function(df, unit_col, value_col) {
  library(ICC)

  # ICC calculation
  icc_result <- ICCbare(unit_col, value_col, data = df)

  # Effective sample size
  n_total <- nrow(df)
  n_units <- length(unique(df[[unit_col]]))
  avg_size <- n_total / n_units
  n_eff <- n_total / (1 + (avg_size - 1) * icc_result)

  cat("Independence Check\n")
  cat(rep("=", 40), "\n")
  cat(sprintf("Total observations: %d\n", n_total))
  cat(sprintf("Unique units: %d\n", n_units))
  cat(sprintf("ICC: %.3f\n", icc_result))
  cat(sprintf("Effective n: %.0f\n", n_eff))
}

# Clustered standard errors
library(sandwich)
library(lmtest)

model <- lm(value ~ treatment, data = df)
coeftest(model, vcov = vcovCL(model, cluster = df$cluster))

# Mixed effects model
library(lme4)
model <- lmer(value ~ treatment + (1|cluster), data = df)
summary(model)


Key Takeaway

Independence is the most critical assumption because violations cannot be fixed with robust methods. When observations are clustered (repeated measures, grouped data, time series), your effective sample size is smaller than your observation count—sometimes dramatically so. The solutions require structural changes: aggregate to independent units, use clustered standard errors, or fit mixed models. Always ask: "Are my observations truly independent?"


References

  1. https://www.jstor.org/stable/2531734
  2. https://doi.org/10.1037/a0014694
  3. Aarts, E., Verhage, M., Veenvliet, J. V., Dolan, C. V., & Van Der Sluis, S. (2014). A solution to dependency: using multilevel analysis to accommodate nested data. *Nature Neuroscience*, 17(4), 491-496.
  4. Moulton, B. R. (1986). Random group effects and the precision of regression estimates. *Journal of Econometrics*, 32(3), 385-397.
  5. Snijders, T. A., & Bosker, R. J. (2011). *Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling* (2nd ed.). Sage.

Frequently Asked Questions

What happens if I ignore non-independence?
Your standard errors are too small, p-values too small, and confidence intervals too narrow. You'll claim significance that doesn't exist and make decisions based on noise.
Can I just use a robust test to fix this?
No. Robust methods like Welch's t-test or bootstrap don't fix independence violations. You need structural solutions: aggregate data, use clustered standard errors, or fit mixed models.
How do I know if my data is independent?
Ask: Can knowing one observation tell me anything about another? If users appear multiple times, if observations are clustered (same classroom, company, region), or if data is ordered in time—independence is likely violated.

Key Takeaway

Independence is the most important and most commonly violated assumption. Unlike normality or equal variance, violations cannot be fixed by using robust methods—they require fundamentally different approaches: aggregating to independent units, using clustered standard errors, or fitting mixed/multilevel models.

Send to a friend

Share this with someone who loves clean statistical work.