Multi-Group Comparisons

One-Way ANOVA: Assumptions, Effect Sizes, and Proper Reporting

A practical guide to one-way ANOVA covering assumptions, diagnostics, effect size measures (eta-squared, omega-squared), and how to report results properly.

Jan 267 min readstatstest_flow Multi-Group Comparisons Supporting

One-Way ANOVA: Assumptions, Effect Sizes, and Proper Reporting

Quick Hits

•ANOVA is robust to moderate normality violations with equal sample sizes
•Unequal variances are more problematic than non-normality—use Welch's ANOVA
•Eta-squared overestimates population effect size; omega-squared is less biased
•Always report effect sizes and confidence intervals, not just F and p

TL;DR

One-way ANOVA compares means across groups by partitioning variance. It's robust to moderate normality violations but sensitive to unequal variances—use Welch's ANOVA when variances differ. Always report effect sizes (omega-squared preferred over eta-squared) alongside F-statistics. A complete report includes group means, F-statistic, degrees of freedom, p-value, effect size with interpretation, and post-hoc results.

The ANOVA Framework

ANOVA partitions total variance into between-group and within-group components:

$SS_{total} = SS_{between} + SS_{within}$

The F-statistic compares these:

$F = \frac{MS_{between}}{MS_{within}} = \frac{SS_{between}/(k-1)}{SS_{within}/(N-k)}$

Where k = number of groups, N = total sample size.

Large F means between-group variance exceeds within-group variance more than expected by chance.

Assumptions

1. Independence

Observations must be independent—one person's score doesn't affect another's.

Violations: Repeated measures on same subjects, clustered data (students in classrooms), time series.

Consequence: Standard errors are wrong; p-values are unreliable.

Solution: Use repeated-measures ANOVA, mixed models, or cluster-robust standard errors.

2. Normality

Data within each group should be approximately normally distributed.

How to check:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def check_normality(groups, group_names=None):
    """Visual and statistical normality checks."""
    k = len(groups)
    if group_names is None:
        group_names = [f'Group {i+1}' for i in range(k)]

    fig, axes = plt.subplots(2, k, figsize=(4*k, 8))

    for i, (g, name) in enumerate(zip(groups, group_names)):
        # Histogram
        axes[0, i].hist(g, bins='auto', edgecolor='black', alpha=0.7)
        axes[0, i].set_title(f'{name} Histogram')

        # Q-Q plot
        stats.probplot(g, dist="norm", plot=axes[1, i])
        axes[1, i].set_title(f'{name} Q-Q Plot')

    plt.tight_layout()
    return fig

Robustness: ANOVA is robust to non-normality with:

Equal or near-equal sample sizes
n > 15-20 per group
Moderate skewness (|skew| < 2)

When it matters: Small samples, severe skewness, unequal group sizes.

3. Homogeneity of Variance (Homoscedasticity)

Groups should have similar variances.

How to check:

from scipy.stats import levene, bartlett

def check_variance_homogeneity(groups):
    """Test for equal variances."""
    # Levene's test (robust to non-normality)
    levene_stat, levene_p = levene(*groups, center='median')

    # Variance ratio (largest/smallest)
    variances = [np.var(g, ddof=1) for g in groups]
    variance_ratio = max(variances) / min(variances)

    return {
        'levene_statistic': levene_stat,
        'levene_p': levene_p,
        'variance_ratio': variance_ratio,
        'rule_of_thumb': 'OK' if variance_ratio < 3 else 'Concern'
    }

Rule of thumb: Variance ratio < 3 is usually acceptable. Larger ratios warrant Welch's ANOVA.

Consequence of violation: Type I error inflation when smaller groups have larger variance; conservatism when larger groups have larger variance.

Effect Sizes

P-values tell you whether an effect exists; effect sizes tell you how large it is.

Eta-Squared (η²)

Proportion of variance explained by group membership:

$\eta^2 = \frac{SS_{between}}{SS_{total}}$

def eta_squared(groups):
    """Calculate eta-squared."""
    grand_mean = np.mean(np.concatenate(groups))

    ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
    ss_total = sum(np.sum((g - grand_mean)**2) for g in groups)

    return ss_between / ss_total

Problem: Eta-squared is positively biased—it overestimates the population effect size, especially with small samples.

Omega-Squared (ω²)

Less biased estimate of population effect size:

$\omega^2 = \frac{SS_{between} - (k-1) \cdot MS_{within}}{SS_{total} + MS_{within}}$

def omega_squared(groups):
    """Calculate omega-squared (less biased than eta-squared)."""
    k = len(groups)
    n_total = sum(len(g) for g in groups)
    grand_mean = np.mean(np.concatenate(groups))

    ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
    ss_within = sum(np.sum((g - np.mean(g))**2) for g in groups)
    ss_total = ss_between + ss_within

    ms_within = ss_within / (n_total - k)

    omega_sq = (ss_between - (k - 1) * ms_within) / (ss_total + ms_within)

    return max(0, omega_sq)  # Can't be negative

Interpreting Effect Sizes

Effect Size	η² / ω²	Interpretation
Small	0.01	1% of variance explained
Medium	0.06	6% of variance explained
Large	0.14	14% of variance explained

Context matters: A "small" effect in psychology might be huge in medicine. Interpret relative to your field and practical significance.

Partial Eta-Squared

In factorial designs, partial η² isolates the effect of one factor:

$\eta_p^2 = \frac{SS_{effect}}{SS_{effect} + SS_{error}}$

This is what most software reports by default.

Complete ANOVA Analysis

import numpy as np
from scipy import stats
import pandas as pd

def complete_anova(groups, group_names=None, alpha=0.05):
    """
    Complete one-way ANOVA analysis with all components.
    """
    if group_names is None:
        group_names = [f'Group {i+1}' for i in range(len(groups))]

    k = len(groups)
    n_total = sum(len(g) for g in groups)
    grand_mean = np.mean(np.concatenate(groups))

    # Sums of squares
    ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
    ss_within = sum(np.sum((g - np.mean(g))**2) for g in groups)
    ss_total = ss_between + ss_within

    # Degrees of freedom
    df_between = k - 1
    df_within = n_total - k
    df_total = n_total - 1

    # Mean squares
    ms_between = ss_between / df_between
    ms_within = ss_within / df_within

    # F-statistic and p-value
    f_stat = ms_between / ms_within
    p_value = 1 - stats.f.cdf(f_stat, df_between, df_within)

    # Effect sizes
    eta_sq = ss_between / ss_total
    omega_sq = max(0, (ss_between - df_between * ms_within) / (ss_total + ms_within))

    # Confidence interval for omega-squared (approximate)
    # Using non-central F distribution

    # Group statistics
    group_stats = pd.DataFrame({
        'Group': group_names,
        'n': [len(g) for g in groups],
        'Mean': [np.mean(g) for g in groups],
        'SD': [np.std(g, ddof=1) for g in groups],
        'SE': [np.std(g, ddof=1) / np.sqrt(len(g)) for g in groups]
    })

    # ANOVA table
    anova_table = pd.DataFrame({
        'Source': ['Between Groups', 'Within Groups', 'Total'],
        'SS': [ss_between, ss_within, ss_total],
        'df': [df_between, df_within, df_total],
        'MS': [ms_between, ms_within, np.nan],
        'F': [f_stat, np.nan, np.nan],
        'p': [p_value, np.nan, np.nan]
    })

    return {
        'group_stats': group_stats,
        'anova_table': anova_table,
        'f_statistic': f_stat,
        'p_value': p_value,
        'df_between': df_between,
        'df_within': df_within,
        'eta_squared': eta_sq,
        'omega_squared': omega_sq,
        'significant': p_value < alpha
    }


# Example
np.random.seed(42)
control = np.random.normal(50, 10, 25)
treatment_a = np.random.normal(55, 10, 25)
treatment_b = np.random.normal(52, 10, 25)

result = complete_anova(
    [control, treatment_a, treatment_b],
    ['Control', 'Treatment A', 'Treatment B']
)

print("Group Statistics:")
print(result['group_stats'].to_string(index=False))
print("\nANOVA Table:")
print(result['anova_table'].to_string(index=False))
print(f"\nEffect Sizes:")
print(f"  η² = {result['eta_squared']:.3f}")
print(f"  ω² = {result['omega_squared']:.3f}")

Reporting Results

APA Style Format

A one-way ANOVA was conducted to compare the effect of treatment condition on test scores. There was a significant effect of treatment at the $p < .05$ level for the three conditions, F(2, 72) = 4.52, $p = .014$ , ω² = .086 [95% CI: .01, .19]. Post-hoc comparisons using Tukey's HSD indicated that Treatment A (M = 55.2, SD = 9.8) was significantly higher than Control (M = 50.1, SD = 10.2), $p = .012$ . Treatment B (M = 52.3, SD = 10.1) did not differ significantly from either Control or Treatment A.

Elements to Include

Test used: One-way ANOVA (or Welch's ANOVA)
Purpose: What was compared
F-statistic: F(df_between, df_within) = value
P-value: Exact value or inequality
Effect size: ω² or η² with interpretation
Group means and SDs: For each group
Post-hoc results: Which groups differ

Common Mistakes

Reporting only F and p, no effect size
Using η² but calling it ω²
Not specifying which post-hoc test was used
Reporting post-hoc without significant omnibus test

When to Use Welch's ANOVA

Use Welch's ANOVA when:

Levene's test is significant ( $p < .05$ )
Variance ratio exceeds 3
You're uncertain about equal variances
As a default (it's never worse than standard ANOVA)

from scipy.stats import alexandergovern

def welch_anova(groups):
    """Welch's ANOVA for unequal variances."""
    result = alexandergovern(*groups)
    return {
        'statistic': result.statistic,
        'p_value': result.pvalue
    }

Comparing More Than Two Groups — The pillar guide
Post-Hoc Tests: Tukey, Dunnett, Games-Howell — Following up significant ANOVA
Visual Diagnostics for Group Comparisons — Checking assumptions visually

Key Takeaway

ANOVA is robust to moderate assumption violations, especially with balanced designs. Focus on effect sizes (omega-squared) and confidence intervals rather than just p-values. When variances differ, use Welch's ANOVA. A complete report includes group means and SDs, F-statistic with degrees of freedom, p-value, effect size with interpretation, and post-hoc results identifying which groups differ.

References

https://www.jstor.org/stable/2529310
https://psycnet.apa.org/record/2004-19012-003
Glass, G. V., Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the fixed effects analyses of variance and covariance. *Review of Educational Research*, 42(3), 237-288.
Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: measures of effect size for some common research designs. *Psychological Methods*, 8(4), 434-447.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science. *Frontiers in Psychology*, 4, 863.

Frequently Asked Questions

How robust is ANOVA to normality violations?

With equal or near-equal sample sizes and n > 15 per group, ANOVA is quite robust to non-normality. Skewness is more problematic than kurtosis. With unequal sample sizes, violations matter more.

What's the difference between eta-squared and omega-squared?

Eta-squared is the sample effect size (SS_between/SS_total). Omega-squared estimates the population effect size with a correction for bias. Omega-squared is preferred for inference.

Should I test assumptions before running ANOVA?

Check assumptions visually (histograms, Q-Q plots) rather than with formal tests. Formal tests are unreliable—they reject trivial violations with large samples and miss serious ones with small samples.

Key Takeaway

ANOVA is robust to moderate assumption violations, especially with balanced designs. Focus on effect sizes (omega-squared) and confidence intervals rather than just p-values. When variances differ, use Welch's ANOVA. Report group means, effect sizes, and follow significant results with appropriate post-hoc tests.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email