Contents
Welch's T-Test vs. Student's T-Test: Why You Should Always Use Welch's
A definitive comparison of Welch's and Student's t-tests. Learn why the equal variance assumption fails in practice and why Welch's should be your default.
Quick Hits
- •Welch's t-test is always safer—it equals Student's when variances are equal and corrects when they're not
- •Student's t-test inflates Type I errors when the smaller group has larger variance
- •Testing for equal variance first doesn't help—it adds complexity without improving decisions
- •Modern statistical practice recommends Welch's as the default two-sample test
TL;DR
Use Welch's t-test, always. Student's t-test assumes equal variances between groups—an assumption that almost never holds exactly and that you can't reliably pre-test. When variances differ, Student's t-test can have dramatically inflated Type I error or reduced power. Welch's t-test handles both equal and unequal variances correctly, with negligible cost when variances happen to be equal.
The Core Difference
Both tests compare means between two independent groups. The difference is in how they estimate variance.
Student's T-Test
Pools variance across both groups:
$$s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}$$
Uses this pooled variance to compute standard error:
$$SE = s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$$
Assumption: Both groups have the same true variance ($\sigma_1^2 = \sigma_2^2$)
Welch's T-Test
Estimates variance separately for each group:
$$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$
Uses Satterthwaite approximation for degrees of freedom:
$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$$
Assumption: None about variance equality
What Goes Wrong with Unequal Variances
When variances differ and you use Student's t-test:
Case 1: Larger Variance in Smaller Group
Type I error rate inflates—you get more false positives than your α level.
import numpy as np
from scipy import stats
def simulate_type1_error(n1, n2, var1, var2, n_sims=10000):
"""Simulate Type I error rate under the null."""
student_rejections = 0
welch_rejections = 0
for _ in range(n_sims):
# Same mean (null is true)
group1 = np.random.normal(0, np.sqrt(var1), n1)
group2 = np.random.normal(0, np.sqrt(var2), n2)
# Student's t-test
_, p_student = stats.ttest_ind(group1, group2, equal_var=True)
if p_student < 0.05:
student_rejections += 1
# Welch's t-test
_, p_welch = stats.ttest_ind(group1, group2, equal_var=False)
if p_welch < 0.05:
welch_rejections += 1
return student_rejections / n_sims, welch_rejections / n_sims
# Larger variance in smaller group
student_rate, welch_rate = simulate_type1_error(n1=20, n2=50, var1=4, var2=1)
print(f"Student's Type I error: {student_rate:.3f} (should be 0.05)")
print(f"Welch's Type I error: {welch_rate:.3f} (should be 0.05)")
# Student's will be ~0.08-0.09, Welch's will be ~0.05
Case 2: Larger Variance in Larger Group
Type I error is conservative—you get fewer rejections than you should, reducing power.
# Larger variance in larger group
student_rate, welch_rate = simulate_type1_error(n1=50, n2=20, var1=4, var2=1)
print(f"Student's Type I error: {student_rate:.3f}")
print(f"Welch's Type I error: {welch_rate:.3f}")
# Student's will be ~0.03, Welch's will be ~0.05
Why Not Test Variances First?
A common but flawed approach: test for equal variances (using Levene's test), then choose Student's or Welch's based on the result.
Problems with Two-Stage Testing
Low power for variance tests: With the sample sizes needed for t-tests, variance tests have poor power. You'll often fail to reject equal variances when they're meaningfully different.
Inflated error rates: The conditional procedure (if p > 0.05 use Student's, else use Welch's) has worse Type I error properties than always using Welch's.
Adds complexity for no benefit: You're introducing a decision point that can only hurt, never help.
def simulate_twostage(n1, n2, var1, var2, n_sims=10000):
"""Simulate the two-stage testing procedure."""
from scipy.stats import levene
twostage_rejections = 0
welch_only_rejections = 0
for _ in range(n_sims):
group1 = np.random.normal(0, np.sqrt(var1), n1)
group2 = np.random.normal(0, np.sqrt(var2), n2)
# Two-stage: test variance first
_, p_levene = levene(group1, group2)
if p_levene > 0.05:
# "Equal" variances - use Student's
_, p_ttest = stats.ttest_ind(group1, group2, equal_var=True)
else:
# Unequal - use Welch's
_, p_ttest = stats.ttest_ind(group1, group2, equal_var=False)
if p_ttest < 0.05:
twostage_rejections += 1
# Always Welch's
_, p_welch = stats.ttest_ind(group1, group2, equal_var=False)
if p_welch < 0.05:
welch_only_rejections += 1
return twostage_rejections / n_sims, welch_only_rejections / n_sims
two_stage, welch_only = simulate_twostage(n1=30, n2=30, var1=2, var2=1)
print(f"Two-stage Type I: {two_stage:.3f}")
print(f"Welch only Type I: {welch_only:.3f}")
Power Comparison
Student's t-test has slightly more power when variances are truly equal. How much?
def simulate_power(n1, n2, var1, var2, effect_size, n_sims=10000):
"""Compare power between Student's and Welch's."""
student_power = 0
welch_power = 0
for _ in range(n_sims):
group1 = np.random.normal(0, np.sqrt(var1), n1)
group2 = np.random.normal(effect_size, np.sqrt(var2), n2)
_, p_student = stats.ttest_ind(group1, group2, equal_var=True)
_, p_welch = stats.ttest_ind(group1, group2, equal_var=False)
if p_student < 0.05:
student_power += 1
if p_welch < 0.05:
welch_power += 1
return student_power / n_sims, welch_power / n_sims
# Equal variances - Student's should have slight edge
student_pwr, welch_pwr = simulate_power(30, 30, 1, 1, 0.5)
print(f"Equal variances - Student's power: {student_pwr:.3f}, Welch's: {welch_pwr:.3f}")
# Unequal variances - Welch's should be better
student_pwr, welch_pwr = simulate_power(30, 30, 2, 1, 0.5)
print(f"Unequal variances - Student's power: {student_pwr:.3f}, Welch's: {welch_pwr:.3f}")
The power advantage of Student's under equal variances is typically 1-2%—negligible compared to the risk of bias under unequal variances.
Implementation
Python
from scipy import stats
# Always use equal_var=False for Welch's
result = stats.ttest_ind(group1, group2, equal_var=False)
# Or use the more explicit version
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(group1, group2, equal_var=False)
R
# R uses Welch's by default (good!)
t.test(group1, group2)
# To get Student's (don't do this)
t.test(group1, group2, var.equal = TRUE)
Reporting
Report results the same way regardless of which test you use:
We compared mean revenue between treatment (M = $42.50, SD = $28.30, n = 1,500) and control (M = $38.20, SD = $31.10, n = 1,520) groups using Welch's t-test. Treatment showed significantly higher revenue, t(2985.4) = 3.94, p < .001, d = 0.14, 95% CI for difference [$2.16, $6.44].
Common Objections
"But I tested and variances are equal!"
Variance tests have low power. Failing to reject doesn't mean variances are equal—it means you can't prove they're different. And even approximately equal variances cause Welch's to perform identically to Student's.
"Student's is more powerful"
By 1-2% when variances are exactly equal. Is that worth the risk of inflated Type I error when they're not? No.
"We've always used Student's"
Historical momentum isn't a statistical argument. Modern guidance from statisticians is clear: use Welch's.
"My software defaults to Student's"
Change the default or specify Welch's explicitly. Don't let software design dictate statistical practice.
Related Methods
- Picking the Right Test to Compare Two Groups — The complete decision framework
- Comparing Variances: Levene, Bartlett, and F-Test — If you do need to test variances
- Equal Variance and Welch's T-Test: When It Matters — Deeper dive on the assumption
Key Takeaway
Always use Welch's t-test for comparing two independent group means. Student's t-test requires equal variances—an assumption that rarely holds and that you can't reliably test. Welch's performs identically when variances are equal and correctly when they're not. There's no good reason to use Student's t-test anymore.
References
- https://www.jstor.org/stable/2682923
- https://link.springer.com/article/10.3758/BF03193146
- Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student's t-test and the Mann-Whitney U test. *Behavioral Ecology*, 17(4), 688-690.
- Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch's t-test instead of Student's t-test. *International Review of Social Psychology*, 30(1), 92-101.
- Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. *British Journal of Mathematical and Statistical Psychology*, 57(1), 173-181.
Frequently Asked Questions
When should I use Student's t-test over Welch's?
Should I test for equal variance before choosing?
Does the choice matter with large samples?
Key Takeaway
Always use Welch's t-test for comparing two independent group means. Student's t-test requires equal variances—an assumption that rarely holds and that you can't reliably test. Welch's performs identically when variances are equal and correctly when they're not.