Assumptions

Assumption Checks and What To Do When They Fail

A comprehensive guide to statistical assumptions in hypothesis testing. Learn which assumptions matter most, how to diagnose violations, and what to do when your data doesn't fit the textbook requirements.

Share

Quick Hits

  • Independence is the most critical assumption—violations can't be fixed with robust methods
  • Normality matters least with large samples due to the Central Limit Theorem
  • Equal variances matter more than normality for t-tests and ANOVA
  • Most assumption tests are underpowered when it matters and overpowered when it doesn't
  • When in doubt, use robust methods or bootstrap—they rarely hurt and often help

TL;DR

Every statistical test has assumptions, but not all assumptions matter equally. Independence is critical—violations require structural solutions. Equal variance matters for group comparisons, especially with unequal sample sizes. Normality matters least, especially for large samples. When assumptions fail, don't panic: you have robust alternatives, transformations, and different tests. The key is matching the severity of your response to the severity of the violation.


The Assumption Hierarchy

Not all assumptions are created equal. Here's the hierarchy of concern:

Critical: Independence

What it means: Observations don't influence each other.

Why it's critical: Violations fundamentally change what your test estimates. A "sample of 1000" might effectively be a sample of 10 if observations are clustered.

Can you fix it?: Not with standard robust methods. Requires structural solutions (mixed models, clustered standard errors, or different experimental design).

Important: Homogeneity of Variance

What it means: Groups have similar spread.

Why it matters: Affects standard error estimation and inference, especially with unequal sample sizes.

Can you fix it?: Yes—Welch's t-test, Games-Howell post-hoc, robust standard errors.

Least Critical: Normality

What it means: Data (or residuals) follow a bell curve.

Why it matters less: Central Limit Theorem means sampling distributions of means become normal with sufficient sample size.

Can you fix it?: Yes—bootstrap, rank-based tests, or often just ignore with large samples.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def demonstrate_clt():
    """
    Show how the Central Limit Theorem makes normality less critical.
    """
    np.random.seed(42)

    # Highly non-normal distribution (exponential)
    population = stats.expon(scale=10)

    sample_sizes = [5, 15, 30, 100]
    n_simulations = 10000

    fig, axes = plt.subplots(2, 4, figsize=(16, 8))

    for i, n in enumerate(sample_sizes):
        # Simulate sample means
        sample_means = [population.rvs(n).mean() for _ in range(n_simulations)]

        # Top row: distribution of sample means
        axes[0, i].hist(sample_means, bins=50, density=True, alpha=0.7)
        axes[0, i].set_title(f'n = {n}')
        axes[0, i].set_xlabel('Sample Mean')

        # Bottom row: Q-Q plot
        stats.probplot(sample_means, dist="norm", plot=axes[1, i])
        axes[1, i].set_title(f'Q-Q Plot (n = {n})')

    axes[0, 0].set_ylabel('Density')
    plt.suptitle('CLT: Sample means become normal as n increases\n(from highly skewed exponential population)')
    plt.tight_layout()
    return fig


# Type I error rates under non-normality
def simulate_type1_nonnormal(sample_sizes, n_sims=10000):
    """
    Show that t-test maintains Type I error even with non-normal data.
    """
    results = {}

    for n in sample_sizes:
        rejections = 0
        for _ in range(n_sims):
            # Two samples from same exponential distribution (null true)
            group1 = stats.expon(scale=10).rvs(n)
            group2 = stats.expon(scale=10).rvs(n)

            _, p = stats.ttest_ind(group1, group2)
            if p < 0.05:
                rejections += 1

        results[n] = rejections / n_sims

    return results

type1_rates = simulate_type1_nonnormal([10, 20, 30, 50, 100])
print("Type I error rates with highly skewed (exponential) data:")
for n, rate in type1_rates.items():
    print(f"  n = {n}: {rate:.3f} (nominal: 0.050)")

Independence: The Silent Killer

Independence violations are the most dangerous because they're often invisible and can't be fixed with standard robust methods.

Common Independence Violations

def identify_independence_issues():
    """
    Common sources of non-independence in product analytics.
    """
    violations = {
        'Repeated measures': {
            'Example': 'Multiple purchases per user',
            'Problem': 'Users are correlated with themselves',
            'Solution': 'Mixed models, or aggregate to user level'
        },
        'Clustering': {
            'Example': 'Users in same company/classroom/region',
            'Problem': 'Within-cluster similarity inflates sample size',
            'Solution': 'Clustered standard errors, mixed models'
        },
        'Time series': {
            'Example': 'Daily metrics, autocorrelated errors',
            'Problem': 'Today predicts tomorrow',
            'Solution': 'Time series models, Newey-West errors'
        },
        'Network effects': {
            'Example': 'Users who interact with each other',
            'Problem': 'Treatment spills over between users',
            'Solution': 'Network randomization, cluster by network'
        },
        'Device/session': {
            'Example': 'Multiple sessions per user',
            'Problem': 'Sessions from same user aren\'t independent',
            'Solution': 'User-level randomization and analysis'
        }
    }
    return violations

Detecting Independence Violations

def check_independence(data, unit_col, observation_col):
    """
    Diagnose potential independence violations.
    """
    # Count observations per unit
    obs_per_unit = data.groupby(unit_col)[observation_col].count()

    results = {
        'total_observations': len(data),
        'unique_units': data[unit_col].nunique(),
        'effective_sample_ratio': data[unit_col].nunique() / len(data),
        'obs_per_unit_mean': obs_per_unit.mean(),
        'obs_per_unit_max': obs_per_unit.max(),
        'units_with_multiple': (obs_per_unit > 1).sum()
    }

    # Diagnosis
    if results['effective_sample_ratio'] < 0.5:
        results['warning'] = (
            f"High clustering: {results['total_observations']} observations "
            f"from only {results['unique_units']} units. "
            f"Effective sample size may be much smaller than n={results['total_observations']}."
        )

    return results


import pandas as pd

# Example: e-commerce data
np.random.seed(42)
n_users = 100
orders = []
for user_id in range(n_users):
    n_orders = np.random.poisson(3) + 1  # 1-7 orders per user
    for _ in range(n_orders):
        orders.append({
            'user_id': user_id,
            'order_value': np.random.exponential(50)
        })

order_df = pd.DataFrame(orders)
independence_check = check_independence(order_df, 'user_id', 'order_value')
print(f"Total orders: {independence_check['total_observations']}")
print(f"Unique users: {independence_check['unique_units']}")
print(f"Effective sample ratio: {independence_check['effective_sample_ratio']:.2f}")
print(f"Mean orders per user: {independence_check['obs_per_unit_mean']:.1f}")

What Independence Violations Do to Your Analysis

def demonstrate_clustering_effect(n_sims=5000):
    """
    Show how clustering inflates Type I error.
    """
    np.random.seed(42)

    # Scenario: 10 clusters of 10 observations each
    n_clusters = 10
    cluster_size = 10

    rejections_naive = 0
    rejections_correct = 0

    for _ in range(n_sims):
        # Generate clustered data (null is true - no group difference)
        # Treatment and control each get some clusters
        data = []
        for cluster in range(n_clusters * 2):
            group = 'treatment' if cluster >= n_clusters else 'control'
            cluster_effect = np.random.normal(0, 5)  # Cluster-level variation

            for _ in range(cluster_size):
                value = cluster_effect + np.random.normal(0, 1)  # Within-cluster
                data.append({'group': group, 'value': value, 'cluster': cluster})

        df = pd.DataFrame(data)

        # Naive analysis (ignores clustering)
        control = df[df['group'] == 'control']['value']
        treatment = df[df['group'] == 'treatment']['value']
        _, p_naive = stats.ttest_ind(control, treatment)
        if p_naive < 0.05:
            rejections_naive += 1

        # Correct analysis (cluster means)
        cluster_means = df.groupby(['group', 'cluster'])['value'].mean().reset_index()
        control_means = cluster_means[cluster_means['group'] == 'control']['value']
        treatment_means = cluster_means[cluster_means['group'] == 'treatment']['value']
        _, p_correct = stats.ttest_ind(control_means, treatment_means)
        if p_correct < 0.05:
            rejections_correct += 1

    print("Type I Error Rates with Clustered Data:")
    print(f"  Naive (ignores clustering): {rejections_naive / n_sims:.3f}")
    print(f"  Correct (uses cluster means): {rejections_correct / n_sims:.3f}")
    print(f"  Nominal rate: 0.050")


demonstrate_clustering_effect()

Equal Variance (Homoscedasticity)

Why It Matters

When groups have unequal variances, the pooled variance estimate is wrong, affecting standard errors and p-values.

def demonstrate_variance_problem():
    """
    Show how unequal variance affects Type I error.
    """
    np.random.seed(42)
    n_sims = 10000

    scenarios = [
        ('Equal var, equal n', [30, 30], [1, 1]),
        ('Unequal var, equal n', [30, 30], [1, 4]),
        ('Larger var in smaller group', [15, 45], [4, 1]),  # Inflates Type I
        ('Larger var in larger group', [45, 15], [4, 1]),   # Conservative
    ]

    results = {}
    for name, ns, vars_ in scenarios:
        rejections_standard = 0
        rejections_welch = 0

        for _ in range(n_sims):
            g1 = np.random.normal(0, np.sqrt(vars_[0]), ns[0])
            g2 = np.random.normal(0, np.sqrt(vars_[1]), ns[1])

            _, p_standard = stats.ttest_ind(g1, g2, equal_var=True)
            _, p_welch = stats.ttest_ind(g1, g2, equal_var=False)

            if p_standard < 0.05:
                rejections_standard += 1
            if p_welch < 0.05:
                rejections_welch += 1

        results[name] = {
            'standard': rejections_standard / n_sims,
            'welch': rejections_welch / n_sims
        }

    print("Type I Error Rates (null true, alpha=0.05):")
    print(f"{'Scenario':<40} {'Standard':>10} {'Welch':>10}")
    print("-" * 60)
    for name, rates in results.items():
        print(f"{name:<40} {rates['standard']:>10.3f} {rates['welch']:>10.3f}")


demonstrate_variance_problem()

Testing for Equal Variance

from scipy.stats import levene, bartlett

def test_homogeneity(*groups):
    """
    Test for equal variances across groups.
    """
    # Levene's test (robust to non-normality)
    levene_stat, levene_p = levene(*groups, center='median')

    # Bartlett's test (assumes normality)
    bartlett_stat, bartlett_p = bartlett(*groups)

    # Variance ratio (rule of thumb)
    variances = [np.var(g, ddof=1) for g in groups]
    var_ratio = max(variances) / min(variances)

    return {
        'levene_p': levene_p,
        'bartlett_p': bartlett_p,
        'variance_ratio': var_ratio,
        'variances': variances,
        'recommendation': 'Use Welch' if var_ratio > 2 else 'Standard OK'
    }


# Example
g1 = np.random.normal(50, 5, 30)   # SD = 5
g2 = np.random.normal(50, 15, 30)  # SD = 15

result = test_homogeneity(g1, g2)
print(f"Levene's test p-value: {result['levene_p']:.4f}")
print(f"Variance ratio: {result['variance_ratio']:.1f}")
print(f"Recommendation: {result['recommendation']}")

Solutions for Unequal Variance

Problem Solution
Two groups Welch's t-test
Multiple groups Welch's ANOVA
Post-hoc comparisons Games-Howell
Regression Robust standard errors
from scipy.stats import alexandergovern

def robust_group_comparison(*groups, group_names=None):
    """
    Compare groups without assuming equal variance.
    """
    if group_names is None:
        group_names = [f'Group {i+1}' for i in range(len(groups))]

    # Two groups: Welch's t-test
    if len(groups) == 2:
        stat, p = stats.ttest_ind(groups[0], groups[1], equal_var=False)
        test_name = "Welch's t-test"
    # More groups: Welch's ANOVA
    else:
        result = alexandergovern(*groups)
        stat, p = result.statistic, result.pvalue
        test_name = "Welch's ANOVA"

    return {
        'test': test_name,
        'statistic': stat,
        'p_value': p,
        'means': {name: np.mean(g) for name, g in zip(group_names, groups)},
        'sds': {name: np.std(g, ddof=1) for name, g in zip(group_names, groups)}
    }

Normality

The Most Overrated Assumption

Normality is the assumption analysts worry about most but matters least.

def why_normality_overrated():
    """
    Demonstrate why normality is often not critical.
    """
    print("Why Normality Is Overrated:")
    print("=" * 50)
    print()
    print("1. CENTRAL LIMIT THEOREM")
    print("   Sampling distribution of means → Normal")
    print("   regardless of population distribution")
    print("   (with sufficient n)")
    print()
    print("2. ROBUSTNESS OF T-TEST")
    print("   Two-sample t-test is remarkably robust")
    print("   to non-normality, especially with equal n")
    print()
    print("3. NORMALITY TESTS ARE PROBLEMATIC")
    print("   Small n: Not enough power to detect violations")
    print("   Large n: Rejects trivial deviations")
    print()
    print("4. WHAT ACTUALLY MATTERS")
    print("   - Severe outliers (affect mean)")
    print("   - Extreme skewness with small n")
    print("   - Heavy tails (affect variance estimates)")


why_normality_overrated()

When Normality Actually Matters

def when_normality_matters():
    """
    Cases where normality genuinely matters.
    """
    cases = {
        'Small samples (n < 15)': {
            'why': 'CLT hasn\'t kicked in yet',
            'solution': 'Use exact tests or bootstrap'
        },
        'Prediction intervals': {
            'why': 'Individual predictions need distribution assumption',
            'solution': 'Use quantile regression or bootstrap'
        },
        'Variance/dispersion tests': {
            'why': 'These are sensitive to distributional form',
            'solution': 'Use robust alternatives (Levene with median)'
        },
        'Very heavy tails': {
            'why': 'Sample mean may not converge quickly',
            'solution': 'Trim, Winsorize, or use median-based methods'
        },
        'Maximum likelihood estimation': {
            'why': 'Efficiency depends on correct distribution',
            'solution': 'Use robust or quasi-maximum likelihood'
        }
    }
    return cases

Assessing Normality (When You Need To)

from scipy.stats import shapiro, normaltest, probplot

def assess_normality(data, alpha=0.05):
    """
    Comprehensive normality assessment.
    Emphasizes visual diagnostics over tests.
    """
    n = len(data)

    results = {
        'n': n,
        'mean': np.mean(data),
        'median': np.median(data),
        'std': np.std(data, ddof=1),
        'skewness': stats.skew(data),
        'kurtosis': stats.kurtosis(data)  # Excess kurtosis
    }

    # Tests (with caveats)
    if n >= 3:
        if n <= 5000:
            shapiro_stat, shapiro_p = shapiro(data)
            results['shapiro_p'] = shapiro_p

        if n >= 20:
            dagostino_stat, dagostino_p = normaltest(data)
            results['dagostino_p'] = dagostino_p

    # Practical assessment
    results['assessment'] = []

    if abs(results['skewness']) > 2:
        results['assessment'].append('Severely skewed')
    elif abs(results['skewness']) > 1:
        results['assessment'].append('Moderately skewed')

    if results['kurtosis'] > 7:
        results['assessment'].append('Very heavy tails')
    elif results['kurtosis'] > 3:
        results['assessment'].append('Somewhat heavy tails')
    elif results['kurtosis'] < -1:
        results['assessment'].append('Light tails (platykurtic)')

    # CLT guidance
    if n >= 100:
        results['clt_guidance'] = 'Large sample—CLT applies, normality usually fine'
    elif n >= 30:
        if abs(results['skewness']) < 1:
            results['clt_guidance'] = 'Moderate sample, mild skew—probably OK'
        else:
            results['clt_guidance'] = 'Moderate sample, notable skew—consider robust methods'
    else:
        results['clt_guidance'] = 'Small sample—verify normality or use alternatives'

    return results


def plot_normality_diagnostics(data, title=''):
    """
    Visual diagnostics for normality.
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    # Histogram with normal overlay
    axes[0].hist(data, bins='auto', density=True, alpha=0.7)
    x = np.linspace(min(data), max(data), 100)
    axes[0].plot(x, stats.norm.pdf(x, np.mean(data), np.std(data)), 'r-', lw=2)
    axes[0].set_title('Histogram with Normal Overlay')
    axes[0].set_xlabel('Value')
    axes[0].set_ylabel('Density')

    # Q-Q plot
    probplot(data, dist="norm", plot=axes[1])
    axes[1].set_title('Q-Q Plot')

    # Box plot
    axes[2].boxplot(data, vert=True)
    axes[2].set_title('Box Plot')
    axes[2].set_ylabel('Value')

    if title:
        fig.suptitle(title, fontsize=14)

    plt.tight_layout()
    return fig


# Example with different distributions
np.random.seed(42)
normal_data = np.random.normal(100, 15, 50)
skewed_data = np.random.exponential(20, 50) + 50
heavy_tailed = np.random.standard_t(3, 50) * 15 + 100

for name, data in [('Normal', normal_data),
                   ('Skewed', skewed_data),
                   ('Heavy-tailed', heavy_tailed)]:
    result = assess_normality(data)
    print(f"\n{name} data (n={result['n']}):")
    print(f"  Skewness: {result['skewness']:.2f}")
    print(f"  Kurtosis: {result['kurtosis']:.2f}")
    print(f"  Assessment: {', '.join(result['assessment']) or 'Reasonably normal'}")
    print(f"  CLT guidance: {result['clt_guidance']}")

The Decision Framework

def assumption_decision_framework(data, groups=None, paired=False):
    """
    Systematic approach to checking assumptions and choosing methods.
    """
    framework = {
        'step1_independence': {
            'question': 'Are observations independent?',
            'checks': [
                'Multiple observations per unit?',
                'Clustering structure?',
                'Time series/autocorrelation?',
                'Network effects?'
            ],
            'if_violated': 'Use mixed models, clustered SEs, or aggregate data'
        },
        'step2_outliers': {
            'question': 'Are there extreme outliers?',
            'checks': [
                'Values > 3 SDs from mean?',
                'Points clearly separated from bulk?',
                'Data entry errors?'
            ],
            'if_violated': 'Investigate source; consider robust methods or trimming'
        },
        'step3_variance': {
            'question': 'Are variances approximately equal?',
            'checks': [
                'Variance ratio > 2-3?',
                'Levene test significant?',
                'Visual spread differs?'
            ],
            'if_violated': 'Use Welch variants or robust standard errors'
        },
        'step4_normality': {
            'question': 'Is normality adequate?',
            'checks': [
                'Q-Q plot reasonably linear?',
                'Skewness < 1-2?',
                'No severe outliers?',
                'Sample size adequate for CLT?'
            ],
            'if_violated': 'Consider bootstrap, rank tests, or transformations'
        }
    }

    return framework


def quick_diagnostic_report(data, groups=None, group_labels=None):
    """
    Generate a quick diagnostic report for assumption checking.
    """
    report = []
    report.append("=" * 60)
    report.append("QUICK DIAGNOSTIC REPORT")
    report.append("=" * 60)

    if groups is None:
        # Single sample
        n = len(data)
        report.append(f"\nSample size: {n}")
        report.append(f"Mean: {np.mean(data):.3f}")
        report.append(f"SD: {np.std(data, ddof=1):.3f}")
        report.append(f"Skewness: {stats.skew(data):.3f}")
        report.append(f"Kurtosis: {stats.kurtosis(data):.3f}")

        # Outlier check
        z_scores = np.abs(stats.zscore(data))
        n_outliers = np.sum(z_scores > 3)
        report.append(f"Potential outliers (|z| > 3): {n_outliers}")

    else:
        # Multiple groups
        if group_labels is None:
            group_labels = [f'Group {i+1}' for i in range(len(groups))]

        report.append("\nGROUP SUMMARY:")
        report.append("-" * 40)

        for label, g in zip(group_labels, groups):
            report.append(f"\n{label}:")
            report.append(f"  n = {len(g)}")
            report.append(f"  Mean = {np.mean(g):.3f}")
            report.append(f"  SD = {np.std(g, ddof=1):.3f}")
            report.append(f"  Skewness = {stats.skew(g):.3f}")

        # Variance check
        variances = [np.var(g, ddof=1) for g in groups]
        var_ratio = max(variances) / min(variances)
        levene_stat, levene_p = levene(*groups, center='median')

        report.append("\nVARIANCE CHECK:")
        report.append("-" * 40)
        report.append(f"Variance ratio: {var_ratio:.2f}")
        report.append(f"Levene's test p: {levene_p:.4f}")

        if var_ratio > 3:
            report.append("⚠️  Substantial variance inequality—use Welch methods")
        elif var_ratio > 2:
            report.append("⚠️  Moderate variance inequality—consider Welch methods")
        else:
            report.append("✓  Variances reasonably similar")

        # Sample size check
        ns = [len(g) for g in groups]
        n_ratio = max(ns) / min(ns)

        report.append("\nSAMPLE SIZE CHECK:")
        report.append("-" * 40)
        report.append(f"Sample sizes: {ns}")
        report.append(f"Size ratio: {n_ratio:.2f}")

        if n_ratio > 2 and var_ratio > 2:
            report.append("⚠️  Unequal n with unequal variance—definitely use Welch")

    report.append("\n" + "=" * 60)

    return "\n".join(report)


# Example
np.random.seed(42)
group_a = np.random.normal(50, 5, 30)
group_b = np.random.normal(55, 15, 20)  # Different mean, variance, and n

print(quick_diagnostic_report(None, [group_a, group_b], ['Control', 'Treatment']))

What To Do When Assumptions Fail

Decision Tree

Is independence violated?
├── YES → Structural solution needed
│         - Mixed models for repeated measures
│         - Clustered standard errors
│         - Aggregate to independent units
│
└── NO → Continue

         Are variances very unequal (ratio > 3)?
         ├── YES → Use robust variance methods
         │         - Welch's t-test / ANOVA
         │         - Games-Howell post-hoc
         │         - Robust standard errors
         │
         └── NO → Continue

                  Are there severe outliers?
                  ├── YES → Investigate and decide
                  │         - If errors: fix or remove
                  │         - If real: use robust methods
                  │         - Trim, Winsorize, or rank-based
                  │
                  └── NO → Continue

                           Is distribution very non-normal?
                           ├── Small n + severe skew →
                           │   Bootstrap or rank-based tests
                           │
                           └── Large n OR mild skew →
                               Standard methods usually OK

Method Selection Guide

def recommend_method(n_groups, n_per_group, variance_ratio,
                     skewness, has_outliers, independence_ok):
    """
    Recommend analysis method based on data characteristics.
    """
    if not independence_ok:
        return {
            'method': 'Address independence first',
            'options': [
                'Mixed models (for repeated measures)',
                'Clustered standard errors',
                'Aggregate to independent unit level',
                'Consult statistician'
            ],
            'severity': 'CRITICAL'
        }

    # Two groups
    if n_groups == 2:
        if variance_ratio > 2 or has_outliers:
            return {
                'method': "Welch's t-test",
                'backup': 'Mann-Whitney U (if outliers severe)',
                'severity': 'MODERATE'
            }
        elif skewness > 2 and n_per_group < 30:
            return {
                'method': 'Mann-Whitney U or bootstrap',
                'severity': 'MODERATE'
            }
        else:
            return {
                'method': "Standard or Welch's t-test",
                'note': "Welch's is safe default",
                'severity': 'LOW'
            }

    # Multiple groups
    else:
        if variance_ratio > 2:
            return {
                'method': "Welch's ANOVA",
                'posthoc': 'Games-Howell',
                'severity': 'MODERATE'
            }
        elif skewness > 2 and n_per_group < 30:
            return {
                'method': 'Kruskal-Wallis',
                'posthoc': "Dunn's test",
                'severity': 'MODERATE'
            }
        else:
            return {
                'method': 'Standard ANOVA or Welch\'s ANOVA',
                'posthoc': "Tukey's HSD or Games-Howell",
                'severity': 'LOW'
            }

Common Mistakes

Mistake 1: Testing Assumptions on Small Samples

Small samples lack power to detect meaningful violations:

def assumption_test_power_problem():
    """
    Show that assumption tests are problematic.
    """
    np.random.seed(42)
    n_sims = 1000

    # Population is clearly non-normal (exponential)
    results = []

    for n in [10, 20, 50, 100, 500]:
        rejections = 0
        for _ in range(n_sims):
            sample = np.random.exponential(10, n)
            _, p = shapiro(sample) if n <= 5000 else normaltest(sample)
            if p < 0.05:
                rejections += 1
        results.append({'n': n, 'power': rejections / n_sims})

    print("Shapiro-Wilk power to detect exponential distribution:")
    print("(Population is clearly non-normal)")
    for r in results:
        print(f"  n = {r['n']:3d}: {r['power']:.1%} rejection rate")


assumption_test_power_problem()

Mistake 2: Transforming Without Understanding

def transformation_pitfalls():
    """
    Common problems with data transformations.
    """
    pitfalls = {
        'Changing the hypothesis': {
            'problem': 'Log transform tests geometric means, not arithmetic',
            'guidance': 'Ask: Is this what I want to estimate?'
        },
        'Zeros': {
            'problem': 'log(0) is undefined',
            'guidance': 'Adding small constant is arbitrary; consider two-part models'
        },
        'Negative values': {
            'problem': "Can't log negative numbers",
            'guidance': 'Shift data or use different transformation'
        },
        'Back-transformation': {
            'problem': 'Mean of log-transformed data ≠ log of mean',
            'guidance': 'Report in original scale with correct back-transformation'
        }
    }
    return pitfalls

Mistake 3: Ignoring Practical Significance of Violations

def violation_severity_guide():
    """
    When violations are and aren't practically important.
    """
    guide = {
        'Normality': {
            'trivial': 'Slight skew with n > 30',
            'moderate': 'Moderate skew with n > 50',
            'serious': 'Severe skew with n < 30, heavy outliers'
        },
        'Equal variance': {
            'trivial': 'Ratio < 2 with equal n',
            'moderate': 'Ratio 2-4 with similar n',
            'serious': 'Ratio > 4, especially with unequal n'
        },
        'Independence': {
            'trivial': 'Never—always address',
            'moderate': 'Never—always address',
            'serious': 'Any violation is serious'
        }
    }
    return guide

Complete Diagnostic Workflow

def full_diagnostic_workflow(groups, group_names, alpha=0.05):
    """
    Complete assumption checking and method recommendation.
    """
    print("=" * 70)
    print("COMPREHENSIVE ASSUMPTION CHECK")
    print("=" * 70)

    # 1. Basic descriptives
    print("\n1. DESCRIPTIVE STATISTICS")
    print("-" * 50)
    for name, g in zip(group_names, groups):
        print(f"\n{name} (n={len(g)}):")
        print(f"  Mean: {np.mean(g):.3f}")
        print(f"  Median: {np.median(g):.3f}")
        print(f"  SD: {np.std(g, ddof=1):.3f}")
        print(f"  Skewness: {stats.skew(g):.3f}")
        print(f"  Kurtosis: {stats.kurtosis(g):.3f}")

        # Outliers
        z = np.abs(stats.zscore(g))
        outliers = np.sum(z > 3)
        if outliers > 0:
            print(f"  ⚠️  Potential outliers: {outliers}")

    # 2. Variance homogeneity
    print("\n\n2. VARIANCE HOMOGENEITY")
    print("-" * 50)
    variances = [np.var(g, ddof=1) for g in groups]
    var_ratio = max(variances) / min(variances)
    lev_stat, lev_p = levene(*groups, center='median')

    print(f"Variances: {[f'{v:.2f}' for v in variances]}")
    print(f"Variance ratio: {var_ratio:.2f}")
    print(f"Levene's test: p = {lev_p:.4f}")

    if var_ratio > 3:
        print("❌ FAIL: Substantial variance inequality")
        variance_ok = False
    elif var_ratio > 2:
        print("⚠️  WARNING: Moderate variance inequality")
        variance_ok = False
    else:
        print("✓ PASS: Variances reasonably similar")
        variance_ok = True

    # 3. Normality (per group)
    print("\n\n3. NORMALITY CHECK")
    print("-" * 50)
    normality_ok = True

    for name, g in zip(group_names, groups):
        n = len(g)
        skew = stats.skew(g)
        kurt = stats.kurtosis(g)

        issues = []
        if abs(skew) > 2:
            issues.append(f"severe skew ({skew:.2f})")
        elif abs(skew) > 1:
            issues.append(f"moderate skew ({skew:.2f})")

        if kurt > 7:
            issues.append(f"very heavy tails (kurtosis={kurt:.2f})")

        if n < 30 and issues:
            print(f"\n{name}: ⚠️  Small n with {', '.join(issues)}")
            normality_ok = False
        elif issues:
            print(f"\n{name}: Note {', '.join(issues)} but n={n} (CLT helps)")
        else:
            print(f"\n{name}: ✓ Reasonably normal")

    # 4. Recommendation
    print("\n\n4. RECOMMENDATION")
    print("-" * 50)

    if len(groups) == 2:
        if not variance_ok:
            print("➤ Use Welch's t-test (handles unequal variance)")
        elif not normality_ok:
            print("➤ Consider Mann-Whitney U or bootstrap")
        else:
            print("➤ Standard t-test OK, but Welch's is safe default")
    else:
        if not variance_ok:
            print("➤ Use Welch's ANOVA with Games-Howell post-hoc")
        elif not normality_ok:
            print("➤ Consider Kruskal-Wallis with Dunn's test")
        else:
            print("➤ Standard ANOVA OK, but Welch's is safe default")

    # 5. Run recommended analysis
    print("\n\n5. RESULTS")
    print("-" * 50)

    if len(groups) == 2:
        t_stat, p_welch = stats.ttest_ind(groups[0], groups[1], equal_var=False)
        print(f"Welch's t-test: t = {t_stat:.3f}, p = {p_welch:.4f}")

        # Effect size
        pooled_std = np.sqrt(((len(groups[0])-1)*np.var(groups[0], ddof=1) +
                             (len(groups[1])-1)*np.var(groups[1], ddof=1)) /
                            (len(groups[0]) + len(groups[1]) - 2))
        cohens_d = (np.mean(groups[0]) - np.mean(groups[1])) / pooled_std
        print(f"Cohen's d: {cohens_d:.3f}")
    else:
        result = alexandergovern(*groups)
        print(f"Welch's ANOVA: stat = {result.statistic:.3f}, p = {result.pvalue:.4f}")

    print("\n" + "=" * 70)


# Example
np.random.seed(42)
control = np.random.normal(50, 5, 40)
treatment_a = np.random.exponential(8, 25) + 48  # Skewed, different variance
treatment_b = np.random.normal(55, 12, 35)

full_diagnostic_workflow(
    [control, treatment_a, treatment_b],
    ['Control', 'Treatment A', 'Treatment B']
)

R Implementation

# Complete assumption checking in R

check_assumptions <- function(data, value_col, group_col) {
  # Load required packages
  library(car)
  library(moments)

  groups <- split(data[[value_col]], data[[group_col]])

  cat("ASSUMPTION CHECKS\n")
  cat(rep("=", 50), "\n\n")

  # 1. Descriptives
  cat("1. DESCRIPTIVES\n")
  cat(rep("-", 30), "\n")
  for (name in names(groups)) {
    g <- groups[[name]]
    cat(sprintf("\n%s (n=%d):\n", name, length(g)))
    cat(sprintf("  Mean: %.3f\n", mean(g)))
    cat(sprintf("  SD: %.3f\n", sd(g)))
    cat(sprintf("  Skewness: %.3f\n", skewness(g)))
  }

  # 2. Variance homogeneity
  cat("\n\n2. VARIANCE HOMOGENEITY\n")
  cat(rep("-", 30), "\n")
  levene_result <- leveneTest(data[[value_col]] ~ data[[group_col]],
                               center = median)
  print(levene_result)

  # 3. Normality by group
  cat("\n3. NORMALITY (Shapiro-Wilk by group)\n")
  cat(rep("-", 30), "\n")
  for (name in names(groups)) {
    g <- groups[[name]]
    if (length(g) >= 3 && length(g) <= 5000) {
      sw <- shapiro.test(g)
      cat(sprintf("%s: W = %.4f, p = %.4f\n", name, sw$statistic, sw$p.value))
    }
  }

  # 4. Run both standard and Welch
  cat("\n4. ANALYSIS COMPARISON\n")
  cat(rep("-", 30), "\n")

  if (length(groups) == 2) {
    cat("\nStandard t-test:\n")
    print(t.test(data[[value_col]] ~ data[[group_col]], var.equal = TRUE))

    cat("\nWelch's t-test:\n")
    print(t.test(data[[value_col]] ~ data[[group_col]], var.equal = FALSE))
  } else {
    cat("\nStandard ANOVA:\n")
    print(summary(aov(data[[value_col]] ~ data[[group_col]])))

    cat("\nWelch's ANOVA:\n")
    print(oneway.test(data[[value_col]] ~ data[[group_col]], var.equal = FALSE))
  }
}

# Example usage
# df <- data.frame(
#   value = c(rnorm(30, 50, 5), rnorm(30, 55, 15)),
#   group = rep(c("Control", "Treatment"), each = 30)
# )
# check_assumptions(df, "value", "group")

Supporting Articles in This Cluster


Key Takeaway

Statistical assumptions exist for good reasons, but their importance varies dramatically. Independence is critical and must be addressed structurally—no robust method saves you from violated independence. Equal variance matters, especially with unequal sample sizes, but is easily handled with Welch variants. Normality matters least due to the Central Limit Theorem. When in doubt, use robust methods: they perform almost as well when assumptions hold and much better when they don't.


References

  1. https://www.jstor.org/stable/2685122
  2. https://psycnet.apa.org/record/1996-04499-005
  3. https://www.jstor.org/stable/2529310
  4. Lix, L. M., Keselman, J. C., & Keselman, H. J. (1996). Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. *Review of Educational Research*, 66(4), 579-619.
  5. Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. *British Journal of Mathematical and Statistical Psychology*, 57(1), 173-181.
  6. Rasch, D., Kubinger, K. D., & Moder, K. (2011). The two-sample t test: pre-testing its assumptions does not pay off. *Statistical Papers*, 52(1), 219-231.
  7. Wilcox, R. R. (2017). *Introduction to Robust Estimation and Hypothesis Testing* (4th ed.). Academic Press.

Frequently Asked Questions

Which assumption is most important?
Independence. Violations of independence (repeated measures, clustering, time-series) fundamentally break standard tests. Normality and equal variance violations are often manageable; independence violations require different methods entirely.
Should I run normality tests before every analysis?
No. With large samples, normality tests reject trivial deviations that don't affect your results. With small samples, they lack power to detect meaningful violations. Use visual diagnostics (Q-Q plots) instead.
What sample size makes normality irrelevant?
For means, the Central Limit Theorem kicks in around n=30 per group for moderate skewness. With severe skew or outliers, you may need n>100 or should use robust methods regardless.
Can I transform data to fix assumption violations?
Sometimes. Log transforms can fix right-skew and stabilize variance simultaneously. But transformations change what you're estimating—you're now testing geometric means, not arithmetic means. Consider whether that's what you want.

Key Takeaway

Statistical assumptions exist for good reasons, but their importance varies dramatically. Independence is critical and must be addressed structurally. Equal variance matters more than normality. Normality matters least for large samples. When assumptions fail, you have options: robust methods, transformations, or different tests entirely. The key is knowing which violations matter for your specific analysis.

Send to a friend

Share this with someone who loves clean statistical work.