Two-Group Comparisons

Welch's T-Test vs. Student's T-Test: Why You Should Always Use Welch's

A definitive comparison of Welch's and Student's t-tests. Learn why the equal variance assumption fails in practice and why Welch's should be your default.

Share

Quick Hits

  • Welch's t-test is always safer—it equals Student's when variances are equal and corrects when they're not
  • Student's t-test inflates Type I errors when the smaller group has larger variance
  • Testing for equal variance first doesn't help—it adds complexity without improving decisions
  • Modern statistical practice recommends Welch's as the default two-sample test

TL;DR

Use Welch's t-test, always. Student's t-test assumes equal variances between groups—an assumption that almost never holds exactly and that you can't reliably pre-test. When variances differ, Student's t-test can have dramatically inflated Type I error or reduced power. Welch's t-test handles both equal and unequal variances correctly, with negligible cost when variances happen to be equal.


The Core Difference

Both tests compare means between two independent groups. The difference is in how they estimate variance.

Student's T-Test

Pools variance across both groups:

$$s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}$$

Uses this pooled variance to compute standard error:

$$SE = s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$$

Assumption: Both groups have the same true variance ($\sigma_1^2 = \sigma_2^2$)

Welch's T-Test

Estimates variance separately for each group:

$$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$

Uses Satterthwaite approximation for degrees of freedom:

$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$$

Assumption: None about variance equality


What Goes Wrong with Unequal Variances

When variances differ and you use Student's t-test:

Case 1: Larger Variance in Smaller Group

Type I error rate inflates—you get more false positives than your α level.

import numpy as np
from scipy import stats

def simulate_type1_error(n1, n2, var1, var2, n_sims=10000):
    """Simulate Type I error rate under the null."""
    student_rejections = 0
    welch_rejections = 0

    for _ in range(n_sims):
        # Same mean (null is true)
        group1 = np.random.normal(0, np.sqrt(var1), n1)
        group2 = np.random.normal(0, np.sqrt(var2), n2)

        # Student's t-test
        _, p_student = stats.ttest_ind(group1, group2, equal_var=True)
        if p_student < 0.05:
            student_rejections += 1

        # Welch's t-test
        _, p_welch = stats.ttest_ind(group1, group2, equal_var=False)
        if p_welch < 0.05:
            welch_rejections += 1

    return student_rejections / n_sims, welch_rejections / n_sims


# Larger variance in smaller group
student_rate, welch_rate = simulate_type1_error(n1=20, n2=50, var1=4, var2=1)
print(f"Student's Type I error: {student_rate:.3f} (should be 0.05)")
print(f"Welch's Type I error: {welch_rate:.3f} (should be 0.05)")
# Student's will be ~0.08-0.09, Welch's will be ~0.05

Case 2: Larger Variance in Larger Group

Type I error is conservative—you get fewer rejections than you should, reducing power.

# Larger variance in larger group
student_rate, welch_rate = simulate_type1_error(n1=50, n2=20, var1=4, var2=1)
print(f"Student's Type I error: {student_rate:.3f}")
print(f"Welch's Type I error: {welch_rate:.3f}")
# Student's will be ~0.03, Welch's will be ~0.05

Why Not Test Variances First?

A common but flawed approach: test for equal variances (using Levene's test), then choose Student's or Welch's based on the result.

Problems with Two-Stage Testing

Low power for variance tests: With the sample sizes needed for t-tests, variance tests have poor power. You'll often fail to reject equal variances when they're meaningfully different.

Inflated error rates: The conditional procedure (if p > 0.05 use Student's, else use Welch's) has worse Type I error properties than always using Welch's.

Adds complexity for no benefit: You're introducing a decision point that can only hurt, never help.

def simulate_twostage(n1, n2, var1, var2, n_sims=10000):
    """Simulate the two-stage testing procedure."""
    from scipy.stats import levene

    twostage_rejections = 0
    welch_only_rejections = 0

    for _ in range(n_sims):
        group1 = np.random.normal(0, np.sqrt(var1), n1)
        group2 = np.random.normal(0, np.sqrt(var2), n2)

        # Two-stage: test variance first
        _, p_levene = levene(group1, group2)
        if p_levene > 0.05:
            # "Equal" variances - use Student's
            _, p_ttest = stats.ttest_ind(group1, group2, equal_var=True)
        else:
            # Unequal - use Welch's
            _, p_ttest = stats.ttest_ind(group1, group2, equal_var=False)

        if p_ttest < 0.05:
            twostage_rejections += 1

        # Always Welch's
        _, p_welch = stats.ttest_ind(group1, group2, equal_var=False)
        if p_welch < 0.05:
            welch_only_rejections += 1

    return twostage_rejections / n_sims, welch_only_rejections / n_sims


two_stage, welch_only = simulate_twostage(n1=30, n2=30, var1=2, var2=1)
print(f"Two-stage Type I: {two_stage:.3f}")
print(f"Welch only Type I: {welch_only:.3f}")

Power Comparison

Student's t-test has slightly more power when variances are truly equal. How much?

def simulate_power(n1, n2, var1, var2, effect_size, n_sims=10000):
    """Compare power between Student's and Welch's."""
    student_power = 0
    welch_power = 0

    for _ in range(n_sims):
        group1 = np.random.normal(0, np.sqrt(var1), n1)
        group2 = np.random.normal(effect_size, np.sqrt(var2), n2)

        _, p_student = stats.ttest_ind(group1, group2, equal_var=True)
        _, p_welch = stats.ttest_ind(group1, group2, equal_var=False)

        if p_student < 0.05:
            student_power += 1
        if p_welch < 0.05:
            welch_power += 1

    return student_power / n_sims, welch_power / n_sims


# Equal variances - Student's should have slight edge
student_pwr, welch_pwr = simulate_power(30, 30, 1, 1, 0.5)
print(f"Equal variances - Student's power: {student_pwr:.3f}, Welch's: {welch_pwr:.3f}")

# Unequal variances - Welch's should be better
student_pwr, welch_pwr = simulate_power(30, 30, 2, 1, 0.5)
print(f"Unequal variances - Student's power: {student_pwr:.3f}, Welch's: {welch_pwr:.3f}")

The power advantage of Student's under equal variances is typically 1-2%—negligible compared to the risk of bias under unequal variances.


Implementation

Python

from scipy import stats

# Always use equal_var=False for Welch's
result = stats.ttest_ind(group1, group2, equal_var=False)

# Or use the more explicit version
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(group1, group2, equal_var=False)

R

# R uses Welch's by default (good!)
t.test(group1, group2)

# To get Student's (don't do this)
t.test(group1, group2, var.equal = TRUE)

Reporting

Report results the same way regardless of which test you use:

We compared mean revenue between treatment (M = $42.50, SD = $28.30, n = 1,500) and control (M = $38.20, SD = $31.10, n = 1,520) groups using Welch's t-test. Treatment showed significantly higher revenue, t(2985.4) = 3.94, p < .001, d = 0.14, 95% CI for difference [$2.16, $6.44].


Common Objections

"But I tested and variances are equal!"

Variance tests have low power. Failing to reject doesn't mean variances are equal—it means you can't prove they're different. And even approximately equal variances cause Welch's to perform identically to Student's.

"Student's is more powerful"

By 1-2% when variances are exactly equal. Is that worth the risk of inflated Type I error when they're not? No.

"We've always used Student's"

Historical momentum isn't a statistical argument. Modern guidance from statisticians is clear: use Welch's.

"My software defaults to Student's"

Change the default or specify Welch's explicitly. Don't let software design dictate statistical practice.



Key Takeaway

Always use Welch's t-test for comparing two independent group means. Student's t-test requires equal variances—an assumption that rarely holds and that you can't reliably test. Welch's performs identically when variances are equal and correctly when they're not. There's no good reason to use Student's t-test anymore.


References

  1. https://www.jstor.org/stable/2682923
  2. https://link.springer.com/article/10.3758/BF03193146
  3. Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student's t-test and the Mann-Whitney U test. *Behavioral Ecology*, 17(4), 688-690.
  4. Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch's t-test instead of Student's t-test. *International Review of Social Psychology*, 30(1), 92-101.
  5. Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. *British Journal of Mathematical and Statistical Psychology*, 57(1), 173-181.

Frequently Asked Questions

When should I use Student's t-test over Welch's?
Almost never. The only theoretical advantage of Student's t-test is slightly more power when variances are truly equal, but this gain is tiny. The risk of bias when variances differ outweighs any benefit.
Should I test for equal variance before choosing?
No. This two-stage procedure (test variances, then choose t-test) has worse properties than simply using Welch's always. Variance tests have low power and the conditional decision inflates error rates.
Does the choice matter with large samples?
Less so, but Welch's is still preferred. With large samples, both tests converge, but Welch's is never worse and the habit of using it consistently prevents mistakes.

Key Takeaway

Always use Welch's t-test for comparing two independent group means. Student's t-test requires equal variances—an assumption that rarely holds and that you can't reliably test. Welch's performs identically when variances are equal and correctly when they're not.

Send to a friend

Share this with someone who loves clean statistical work.