A/B Testing

Choosing the Right Test for Conversion Rates: Z-Test, Chi-Square, or Fisher's Exact

A practical decision framework for selecting between z-test, chi-square test, and Fisher's exact test when comparing conversion rates in A/B experiments.

Jan 268 min readstatstest_flow A/B Testing Supporting

Choosing the Right Test for Conversion Rates: Z-Test, Chi-Square, or Fisher's Exact

Quick Hits

•The two-proportion z-test handles 90% of conversion rate experiments—it's your default choice
•Fisher's exact test is only necessary for small samples (n < 1,000) or extreme rates (< 1%)
•Chi-square and z-test give identical results for two groups—z² = χ²
•All three tests answer the same question, just with different computational approaches

TL;DR

Three tests dominate conversion rate comparisons: the two-proportion z-test, chi-square test, and Fisher's exact test. For typical A/B tests with thousands of users, they give nearly identical results. The z-test is your default. Fisher's exact handles small samples. Chi-square scales to multiple variants. This guide explains when each applies and why it often doesn't matter which you choose.

The Core Question All Three Tests Answer

You ran an A/B test. Control had 500 conversions out of 10,000 visitors (5.0%). Treatment had 550 conversions out of 10,000 visitors (5.5%). Is that 0.5 percentage point difference real, or just random noise?

All three tests formalize this question as a hypothesis test:

H₀ (null): Both groups have the same true conversion rate
H₁ (alternative): The conversion rates differ

The p-value tells you: if there were no true difference, how often would you see a gap this large or larger by chance alone?

Two-Proportion Z-Test: The Workhorse

The z-test is the most commonly used method for conversion rate experiments. It works by assuming that with large enough samples, the difference in proportions follows a normal distribution.

The Math

The test statistic is:

$z = \frac{(p_1 - p_2)}{\sqrt{p_{pool}(1-p_{pool})(\frac{1}{n_1} + \frac{1}{n_2})}}$

Where:

$p_1, p_2$ are the observed conversion rates
$p_{pool} = \frac{x_1 + x_2}{n_1 + n_2}$ is the pooled proportion
$n_1, n_2$ are the sample sizes

Python Implementation

from statsmodels.stats.proportion import proportions_ztest
import numpy as np

# Example: 500/10000 control vs 550/10000 treatment
conversions = np.array([500, 550])
sample_sizes = np.array([10000, 10000])

z_stat, p_value = proportions_ztest(conversions, sample_sizes, alternative='two-sided')

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")
# Output: Z-statistic: -1.5430, P-value: 0.1228

R Implementation

# Two-proportion z-test in R
prop.test(c(500, 550), c(10000, 10000), correct = FALSE)

# Or using the formula directly
z_test_proportions <- function(x1, n1, x2, n2) {
  p1 <- x1 / n1
  p2 <- x2 / n2
  p_pool <- (x1 + x2) / (n1 + n2)

  z <- (p1 - p2) / sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
  p_value <- 2 * pnorm(-abs(z))

  list(z = z, p_value = p_value)
}

z_test_proportions(500, 10000, 550, 10000)

When It Works

The z-test relies on the normal approximation to the binomial distribution. This approximation is valid when:

Each group has at least 5 successes AND 5 failures
Sample sizes are reasonably large (typically n > 30 per group)

For most A/B tests with thousands of users, these conditions are easily met.

Chi-Square Test: The Generalizable Sibling

The chi-square test compares observed frequencies to expected frequencies under the null hypothesis. For two groups, it's mathematically equivalent to the z-test (χ² = z²).

The Math

$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$

Where $O_i$ is observed count and $E_i$ is expected count under the null.

For a $2 \\times 2$ table (two groups, binary outcome), there's a single degree of freedom, and the chi-square statistic equals the square of the z-statistic.

Python Implementation

from scipy.stats import chi2_contingency
import numpy as np

# Create contingency table
#           Converted  Not Converted
# Control     500        9500
# Treatment   550        9450
table = np.array([[500, 9500],
                  [550, 9450]])

chi2, p_value, dof, expected = chi2_contingency(table, correction=False)

print(f"Chi-square: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
# Output: Chi-square: 2.3809, P-value: 0.1228 (same as z-test!)

R Implementation

# Chi-square test in R
table <- matrix(c(500, 550, 9500, 9450), nrow = 2)
chisq.test(table, correct = FALSE)

When to Use Chi-Square Over Z-Test

Use chi-square when:

You're comparing 3 or more variants simultaneously
You want a single test for whether any group differs from the others
Your organization standardizes on chi-square for historical reasons

For two groups, z-test and chi-square are interchangeable.

Fisher's Exact Test: For Small Samples

When sample sizes are small, the normal approximation breaks down. Fisher's exact test calculates the exact probability of observing your data (or more extreme) under the null hypothesis.

Why "Exact"?

The z-test and chi-square use approximations. Fisher's test enumerates all possible tables with the same row and column totals and calculates exact probabilities using the hypergeometric distribution. No approximation needed.

Python Implementation

from scipy.stats import fisher_exact
import numpy as np

# Small sample example: 15/200 control vs 25/200 treatment
table = np.array([[15, 185],
                  [25, 175]])

odds_ratio, p_value = fisher_exact(table, alternative='two-sided')

print(f"Odds ratio: {odds_ratio:.4f}")
print(f"P-value: {p_value:.4f}")

R Implementation

# Fisher's exact test in R
table <- matrix(c(15, 25, 185, 175), nrow = 2)
fisher.test(table)

When to Use Fisher's Exact

Fisher's exact test is appropriate when:

Small samples: Total sample size under ~1,000
Sparse cells: Any cell in the $2 \\times 2$ table has fewer than 5 expected observations
Extreme rates: Conversion rate is very low (< 1%) or very high (> 99%)
You want certainty: In regulatory contexts where approximations are questioned

The Computational Tradeoff

Fisher's exact test computes exact probabilities, which is computationally expensive for large samples. With 10,000 users per group, calculation can be slow. The z-test gives the same answer instantly.

Decision Framework

Here's when to use each test:

Situation	Recommended Test
Large sample (n > 1,000 per group), typical rates (1-50%)	Z-test
Two groups, any large sample	Z-test or Chi-square (identical results)
Three or more variants	Chi-square
Small sample (n < 1,000 per group)	Fisher's exact
Very low rate (< 1%) or sparse cells	Fisher's exact
Regulatory/high-stakes context	Fisher's exact

A Practical Algorithm

if (minimum_cell_count < 5 or sample_per_group < 1000):
    use Fisher's exact test
elif (number_of_groups > 2):
    use Chi-square test
else:
    use Z-test (or Chi-square, they're equivalent)

Why It Often Doesn't Matter

For typical A/B tests—thousands of users, conversion rates between 1% and 50%—all three tests give nearly identical p-values:

from scipy.stats import chi2_contingency, fisher_exact
from statsmodels.stats.proportion import proportions_ztest
import numpy as np

# Large sample example
table = np.array([[500, 9500], [550, 9450]])

# Z-test
z_stat, p_z = proportions_ztest([500, 550], [10000, 10000])

# Chi-square
chi2, p_chi, _, _ = chi2_contingency(table, correction=False)

# Fisher's exact
_, p_fisher = fisher_exact(table)

print(f"Z-test p-value: {p_z:.6f}")
print(f"Chi-square p-value: {p_chi:.6f}")
print(f"Fisher's exact p-value: {p_fisher:.6f}")

# All three will be essentially identical: ~0.1228

The differences only emerge at the edges: tiny samples, extreme rates, or sparse cells.

Common Mistakes

Using Fisher's Exact for Large Samples

Fisher's exact test is conservative (slightly higher p-values) compared to z-test for large samples. More importantly, it's computationally wasteful. If you have 50,000 users per group, just use the z-test.

Applying Continuity Correction by Default

Some implementations (like prop.test in R) apply Yates' continuity correction by default. This makes the test more conservative, which may reduce power unnecessarily for large samples. Be explicit about whether you want the correction.

Confusing Chi-Square for Multiple Comparisons

Chi-square tells you whether any group differs from the others—it doesn't tell you which groups differ. If you reject the null with 3+ groups, you need post-hoc tests (like pairwise comparisons with correction) to identify which specific groups differ.

Ignoring Effect Size

All three tests give p-values, but p-values don't measure effect size. Always report the actual conversion rates and confidence intervals alongside your p-value.

Confidence Intervals: The Better Output

Rather than obsessing over which test to use, focus on confidence intervals for the difference in proportions. A 95% CI gives more information than a binary significant/not significant decision.

from statsmodels.stats.proportion import confint_proportions_2indep

# 95% CI for the difference in proportions
ci_low, ci_upp = confint_proportions_2indep(
    count1=500, nobs1=10000,
    count2=550, nobs2=10000,
    method='wald'
)

print(f"95% CI for difference: [{ci_low:.4f}, {ci_upp:.4f}]")
# Output: [-0.0113, 0.0013]
# CI includes 0, consistent with p > 0.05

A confidence interval of [-0.3%, +1.3%] tells you far more than " $p = 0.12$ ."

A/B Testing Statistical Methods for Product Teams — The pillar guide covering all A/B testing methods
MDE and Sample Size: A Practical Guide — How to calculate the sample you need
Non-Normal Metrics: Bootstrap, Mann-Whitney, and Log Transforms — When your metric isn't binary

Frequently Asked Questions

Q: I've seen prop.test in R give different results than statsmodels in Python. Why? A: R's prop.test applies Yates' continuity correction by default. Use correct = FALSE to match Python's default behavior.

Q: Should I use one-tailed or two-tailed tests? A: Use two-tailed. In product experimentation, you should care if a change makes things worse, not just whether it makes things better.

Q: My sample sizes are unequal. Does that matter? A: No. All three tests handle unequal sample sizes correctly in their variance calculations.

Q: What about Barnard's exact test? A: Barnard's test is an exact test that's often more powerful than Fisher's. It's less commonly used because it's computationally intensive and results are very similar to Fisher's for most practical cases.

Key Takeaway

For most conversion rate experiments with reasonable sample sizes, use the two-proportion z-test. Reserve Fisher's exact test for small samples or extreme rates. Use chi-square when comparing more than two variants simultaneously. In practice, the choice rarely affects conclusions—what matters more is proper experiment design and honest interpretation.

References

https://www.itl.nist.gov/div898/handbook/prc/section3/prc33.htm
https://www.jstor.org/stable/2340521
Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: comparison of seven methods. *Statistics in Medicine*, 17(8), 857-872.
Fisher, R. A. (1922). On the interpretation of χ² from contingency tables, and the calculation of P. *Journal of the Royal Statistical Society*, 85(1), 87-94.
Agresti, A. (2002). *Categorical Data Analysis* (2nd ed.). Wiley-Interscience.

Frequently Asked Questions

When should I use Fisher's exact test?

Use Fisher's exact test when sample sizes are small (under ~1,000 per group) or when conversion rates are very low (under 1%). The normal approximation used by z-test and chi-square breaks down in these cases.

Is chi-square better than z-test?

For two groups, they're mathematically identical—you'll get the same p-value. Chi-square extends naturally to 3+ groups, while z-test is limited to two-group comparisons.

What if I have unequal sample sizes between groups?

All three tests handle unequal sample sizes correctly. The formulas account for different group sizes in their variance calculations.

Key Takeaway

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email