Two-Group Comparisons

Picking the Right Test to Compare Two Groups: A Decision Framework

A comprehensive guide to choosing between t-tests, Mann-Whitney, bootstrap, and other methods when comparing two groups. Covers continuous, binary, and count data with practical decision trees.

Share

Quick Hits

  • Welch's t-test is your default for continuous data—always use it over Student's t
  • Mann-Whitney tests 'tends to be larger' not 'has larger mean'—know the difference
  • Paired data requires paired tests—the wrong choice inflates variance dramatically
  • Bootstrap handles anything, but start simple and add complexity only when needed

TL;DR

Comparing two groups is the bread and butter of statistical analysis. Your default should be Welch's t-test for continuous data and two-proportion z-test for binary outcomes. Reach for alternatives when data is paired, heavily skewed, or when you need to test something other than means. This guide provides decision frameworks for every common scenario.


The Decision Framework

Before choosing a test, answer three questions:

  1. What type of data? Continuous, binary, count, ordinal?
  2. Are observations paired or independent? Same subjects, or different subjects?
  3. What are you testing? Means, medians, "tends to be larger," proportions?
What type of data?
├── Continuous (revenue, time, scores)
│   ├── Paired? → Paired t-test or Wilcoxon signed-rank
│   └── Independent? → Welch's t-test or Mann-Whitney
├── Binary (converted/didn't, clicked/didn't)
│   ├── Paired? → McNemar's test
│   └── Independent? → Z-test or Chi-square
├── Count (sessions, purchases, events)
│   └── Poisson or Negative Binomial methods
└── Ordinal (rankings, Likert scales)
    └── Mann-Whitney or ordinal regression

Continuous Data: The T-Test Family

Welch's T-Test: Your Default

For comparing means between two independent groups, Welch's t-test should be your default. It doesn't assume equal variances between groups—and variances are almost never equal.

from scipy import stats
import numpy as np

def welch_ttest(group1, group2, alternative='two-sided'):
    """
    Welch's t-test for comparing two group means.
    """
    stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)

    # Effect size (Cohen's d with pooled SD)
    n1, n2 = len(group1), len(group2)
    pooled_std = np.sqrt(((n1-1)*np.var(group1, ddof=1) + (n2-1)*np.var(group2, ddof=1)) / (n1+n2-2))
    cohens_d = (np.mean(group2) - np.mean(group1)) / pooled_std

    # Confidence interval for difference
    diff = np.mean(group2) - np.mean(group1)
    se = np.sqrt(np.var(group1, ddof=1)/n1 + np.var(group2, ddof=1)/n2)
    df = (np.var(group1, ddof=1)/n1 + np.var(group2, ddof=1)/n2)**2 / \
         ((np.var(group1, ddof=1)/n1)**2/(n1-1) + (np.var(group2, ddof=1)/n2)**2/(n2-1))
    t_crit = stats.t.ppf(0.975, df)
    ci = (diff - t_crit * se, diff + t_crit * se)

    return {
        't_statistic': stat,
        'p_value': p_value,
        'mean_difference': diff,
        'ci_95': ci,
        'cohens_d': cohens_d
    }

R Implementation

# Welch's t-test (default in R)
t.test(group2, group1)

# With equal variance assumption (don't use this)
t.test(group2, group1, var.equal = TRUE)

Why Not Student's T-Test?

Student's t-test assumes equal variances. When variances differ:

  • Type I error inflates when the larger variance is in the smaller group
  • Type I error is conservative when the larger variance is in the larger group

Welch's t-test handles both cases correctly and performs essentially identically to Student's when variances are equal. There's no reason to use Student's anymore.

For more on this comparison, see Welch's vs. Student's T-Test.


Paired vs. Independent Data

The distinction between paired and independent data is crucial—using the wrong test can dramatically inflate your variance estimate.

When Data Is Paired

Data is paired when:

  • Same subjects measured twice: Before/after, pre/post treatment
  • Matched subjects: Each treatment subject matched to a similar control
  • Natural pairs: Left/right eye, twins, matched geographic regions

The Paired T-Test

def paired_ttest(before, after):
    """
    Paired t-test using within-subject differences.
    """
    differences = np.array(after) - np.array(before)

    stat, p_value = stats.ttest_1samp(differences, 0)

    mean_diff = np.mean(differences)
    se = np.std(differences, ddof=1) / np.sqrt(len(differences))
    ci = (mean_diff - 1.96 * se, mean_diff + 1.96 * se)

    return {
        't_statistic': stat,
        'p_value': p_value,
        'mean_difference': mean_diff,
        'ci_95': ci
    }

Why Pairing Matters

Consider comparing a weight loss intervention:

Without pairing (independent samples t-test):

  • Compares average weight in treatment vs. control
  • Variance includes between-person differences (some people weigh 150 lbs, others 250 lbs)

With pairing (paired t-test):

  • Compares weight change within each person
  • Variance only includes variation in weight changes
  • Often 5-10x more powerful
# Simulation showing power difference
np.random.seed(42)
n = 30

# People have different baseline weights (high between-person variance)
baseline_weights = np.random.normal(180, 40, n)

# Treatment causes 5 lb loss (small effect)
treatment_effect = -5
noise = np.random.normal(0, 3, n)

before = baseline_weights + np.random.normal(0, 3, n)
after = baseline_weights + treatment_effect + noise

# Independent test (wrong!) - treats them as different people
_, p_independent = stats.ttest_ind(before, after)

# Paired test (correct)
_, p_paired = stats.ttest_rel(before, after)

print(f"Independent t-test p-value: {p_independent:.4f}")
print(f"Paired t-test p-value: {p_paired:.4f}")
# Paired test will be much more significant

For diagnosing whether your data is paired, see Paired vs. Independent Data: A Diagnostic Checklist.


Non-Parametric Alternatives

When normality assumptions are questionable, non-parametric tests don't assume any particular distribution.

Mann-Whitney U Test

For independent samples, Mann-Whitney tests whether one group tends to have larger values (stochastic dominance).

def mann_whitney_analysis(group1, group2):
    """
    Mann-Whitney U test with effect size.
    """
    stat, p_value = stats.mannwhitneyu(group1, group2, alternative='two-sided')

    # Common language effect size (probability group2 > group1)
    n1, n2 = len(group1), len(group2)
    prob_superiority = stat / (n1 * n2)

    return {
        'u_statistic': stat,
        'p_value': p_value,
        'prob_group2_larger': prob_superiority
    }

Important: Mann-Whitney does NOT test whether means differ. Two groups can have identical means but different Mann-Whitney results (if their shapes differ). And two groups can have different means but identical Mann-Whitney results (if the differences are symmetric).

For the nuances, see Mann-Whitney U Test: What It Actually Tests.

Wilcoxon Signed-Rank Test

The paired equivalent of Mann-Whitney:

def wilcoxon_analysis(before, after):
    """
    Wilcoxon signed-rank test for paired data.
    """
    stat, p_value = stats.wilcoxon(before, after, alternative='two-sided')

    return {
        'statistic': stat,
        'p_value': p_value
    }

Binary Outcomes

For comparing proportions (conversion rates, click rates), use tests designed for binary data.

Two-Proportion Z-Test

from statsmodels.stats.proportion import proportions_ztest, confint_proportions_2indep

def compare_proportions(successes1, n1, successes2, n2):
    """
    Compare two proportions with confidence interval.
    """
    stat, p_value = proportions_ztest([successes1, successes2], [n1, n2])

    p1, p2 = successes1/n1, successes2/n2
    diff = p2 - p1

    # Confidence interval for difference
    ci_low, ci_high = confint_proportions_2indep(successes1, n1, successes2, n2)

    return {
        'p1': p1,
        'p2': p2,
        'difference': diff,
        'ci_95': (ci_low, ci_high),
        'p_value': p_value
    }

Chi-Square Test

Equivalent to z-test for two groups, but extends to multiple groups:

from scipy.stats import chi2_contingency

# Create contingency table
table = [[successes1, n1 - successes1],
         [successes2, n2 - successes2]]

chi2, p_value, dof, expected = chi2_contingency(table, correction=False)

Fisher's Exact Test

For small samples or extreme proportions:

from scipy.stats import fisher_exact

table = [[successes1, n1 - successes1],
         [successes2, n2 - successes2]]

odds_ratio, p_value = fisher_exact(table)

Bootstrap Methods

When standard tests don't fit or you want distribution-free inference, bootstrap handles anything.

Bootstrap for Mean Difference

def bootstrap_comparison(group1, group2, n_bootstrap=10000, alpha=0.05):
    """
    Bootstrap comparison of two groups.
    Works for any statistic, not just means.
    """
    observed_diff = np.mean(group2) - np.mean(group1)

    # Bootstrap distribution
    diffs = []
    for _ in range(n_bootstrap):
        boot1 = np.random.choice(group1, size=len(group1), replace=True)
        boot2 = np.random.choice(group2, size=len(group2), replace=True)
        diffs.append(np.mean(boot2) - np.mean(boot1))

    diffs = np.array(diffs)

    # Percentile CI
    ci_low = np.percentile(diffs, 100 * alpha / 2)
    ci_high = np.percentile(diffs, 100 * (1 - alpha / 2))

    # Bootstrap p-value (permutation approach)
    combined = np.concatenate([group1, group2])
    null_diffs = []
    for _ in range(n_bootstrap):
        np.random.shuffle(combined)
        perm1 = combined[:len(group1)]
        perm2 = combined[len(group1):]
        null_diffs.append(np.mean(perm2) - np.mean(perm1))

    null_diffs = np.array(null_diffs)
    p_value = np.mean(np.abs(null_diffs) >= np.abs(observed_diff))

    return {
        'difference': observed_diff,
        'ci_95': (ci_low, ci_high),
        'p_value': p_value
    }

When Bootstrap Excels

  • Heavy-tailed data (revenue, latency)
  • Complex statistics (medians, percentiles, ratios)
  • Small samples where CLT may not apply
  • When you want honest confidence intervals

For more on bootstrap with difficult data, see Bootstrap Confidence Intervals for Difference in Means.


Comparing Medians

Sometimes the median is more relevant than the mean—when outliers dominate or when "typical" is the business question.

Mood's Median Test

Tests whether groups have the same median:

from scipy.stats import median_test

stat, p_value, med, table = median_test(group1, group2)

Quantile Regression

More flexible approach that estimates conditional quantiles:

import statsmodels.formula.api as smf
import pandas as pd

# Create dataframe
df = pd.DataFrame({
    'value': np.concatenate([group1, group2]),
    'treatment': [0]*len(group1) + [1]*len(group2)
})

# Median regression
model = smf.quantreg('value ~ treatment', df)
result = model.fit(q=0.5)
print(result.summary())

For more options, see Comparing Medians: Tests and Better Options.


Comparing Variances

Sometimes you care about spread, not just center.

Levene's Test

Robust test for equality of variances:

from scipy.stats import levene

stat, p_value = levene(group1, group2)

Brown-Forsythe Test

Uses median instead of mean—even more robust:

stat, p_value = levene(group1, group2, center='median')

For details, see Comparing Variances: Levene, Bartlett, and F-Test.


Handling Outliers

Outliers can dominate mean comparisons. Options:

Trimmed Mean Test

Remove extreme values from both tails:

from scipy.stats import trim_mean
from scipy.stats.mstats import trimmed_std

def trimmed_ttest(group1, group2, proportiontocut=0.1):
    """
    T-test on trimmed means (10% from each tail by default).
    """
    tm1 = trim_mean(group1, proportiontocut)
    tm2 = trim_mean(group2, proportiontocut)

    # Yuen's test
    stat, p_value = stats.mstats.ttest_ind(group1, group2, trim=proportiontocut)

    return {
        'trimmed_mean1': tm1,
        'trimmed_mean2': tm2,
        'difference': tm2 - tm1,
        'p_value': p_value
    }

Winsorized Test

Cap extreme values at percentiles:

from scipy.stats.mstats import winsorize

group1_wins = winsorize(group1, limits=[0.05, 0.05])  # Cap at 5th and 95th percentile
group2_wins = winsorize(group2, limits=[0.05, 0.05])

stat, p_value = stats.ttest_ind(group1_wins, group2_wins)

For guidance, see Handling Outliers: Trimmed Means, Winsorization, and Robust Methods.


Quick Reference Table

Situation Recommended Test Alternative
Two independent groups, continuous Welch's t-test Bootstrap, Mann-Whitney
Two paired groups, continuous Paired t-test Wilcoxon signed-rank
Two groups, binary outcome Z-test / Chi-square Fisher's exact (small n)
Two paired groups, binary McNemar's test
Heavy-tailed data Bootstrap Trimmed mean, Winsorized
Ordinal data Mann-Whitney Ordinal regression
Testing medians Bootstrap Quantile regression
Testing variances Levene's test Brown-Forsythe

Dive deeper into specific topics:


Frequently Asked Questions

Q: Should I test for normality before choosing a test? A: No. Formal normality tests are unreliable—they reject trivial deviations with large samples and miss serious violations with small samples. Look at histograms, check for outliers, and use domain knowledge.

Q: Is it wrong to use the t-test on non-normal data? A: Usually not. The t-test is robust to non-normality with reasonable sample sizes (n > 30). Extreme skew or outliers matter more than technical non-normality.

Q: What if my two groups have very different sizes? A: Welch's t-test handles unequal sizes naturally. Power will be limited by the smaller group. Consider whether the imbalance reflects reality or a sampling issue.

Q: Can I use multiple tests and pick the one that gives significance? A: No—that's p-hacking. Choose your test before seeing results based on your data type and question. If you run multiple tests as sensitivity analyses, report all of them.


Key Takeaway

Start with Welch's t-test for continuous data and two-proportion z-test for binary data. These handle most situations correctly. Reach for bootstrap when distributions are extreme, Mann-Whitney when testing stochastic dominance rather than means, and paired tests when data is naturally paired. The "perfect" test matters less than understanding what question you're actually answering and reporting effect sizes with confidence intervals.


References

  1. https://www.jstor.org/stable/2682923
  2. https://www.bmj.com/content/350/bmj.h2053
  3. Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student's t-test and the Mann-Whitney U test. *Behavioral Ecology*, 17(4), 688-690.
  4. Fay, M. P., & Proschan, M. A. (2010). Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. *Statistics Surveys*, 4, 1-39.
  5. Efron, B., & Tibshirani, R. J. (1993). *An Introduction to the Bootstrap*. Chapman & Hall/CRC.

Frequently Asked Questions

Should I test for normality before choosing a test?
No. Formal normality tests are unreliable—rejecting with large samples for trivial deviations, failing to reject with small samples even for serious violations. Look at histograms and use domain knowledge instead.
When is Mann-Whitney better than the t-test?
When you care about 'tends to be larger' rather than 'has larger mean,' for ordinal data, or as a robustness check. For comparing means specifically, t-test or bootstrap is usually better.
What's the difference between paired and independent tests?
Paired tests use within-subject differences, removing between-subject variance. Use paired when the same subjects appear in both groups (before/after) or when subjects are matched.

Key Takeaway

Start with Welch's t-test for continuous data and two-proportion z-test for binary data. These handle most situations. Reach for bootstrap when distributions are extreme, Mann-Whitney when testing stochastic dominance, and paired tests when data is naturally paired. The 'perfect' test matters less than understanding what question you're actually answering.

Send to a friend

Share this with someone who loves clean statistical work.