Contents
When Confidence Intervals and P-Values Seem to Disagree
Understand why CIs and p-values sometimes appear to conflict and how to resolve these apparent contradictions. Learn common scenarios and the correct interpretation.
Quick Hits
- •95% CI excluding 0 ⟺ p < 0.05 for two-sided tests of difference = 0
- •Apparent conflicts usually involve comparing the wrong things
- •Overlapping group CIs ≠ non-significant difference
- •One-sided p-values and two-sided CIs don't match directly
TL;DR
When p-values and confidence intervals seem to contradict each other, there's almost always a mismatch in what's being compared. The core rule: for two-sided tests with α = 0.05, a 95% CI that excludes 0 always means p < 0.05, and vice versa. Common "conflicts" involve overlapping group CIs (which don't imply non-significance), one-sided vs. two-sided comparisons, or different alpha levels.
The Fundamental Relationship
They're Mathematically Linked
import numpy as np
from scipy import stats
def demonstrate_linkage():
"""
Show the mathematical link between p-values and CIs.
"""
np.random.seed(42)
# Generate data
control = np.random.normal(100, 15, 50)
treatment = np.random.normal(108, 15, 50)
diff = np.mean(treatment) - np.mean(control)
se = np.sqrt(np.var(control, ddof=1)/50 + np.var(treatment, ddof=1)/50)
# Two-sided t-test
t_stat, p_value = stats.ttest_ind(control, treatment)
# 95% CI for difference
t_crit = stats.t.ppf(0.975, 98)
ci = (diff - t_crit * se, diff + t_crit * se)
print("THE FUNDAMENTAL RELATIONSHIP")
print("=" * 50)
print()
print(f"Difference: {diff:.2f}")
print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")
print(f"P-value (two-sided): {p_value:.4f}")
print()
# Check relationship
ci_excludes_zero = ci[0] > 0 or ci[1] < 0
p_significant = p_value < 0.05
print("Checking relationship:")
print(f" 95% CI excludes 0: {ci_excludes_zero}")
print(f" p < 0.05: {p_significant}")
print()
if ci_excludes_zero == p_significant:
print("✓ They AGREE (as they always must)")
else:
print("⚠ Something is wrong (this should never happen)")
demonstrate_linkage()
Why They Must Agree
def explain_equivalence():
"""
Explain why 95% CI and p = 0.05 are equivalent.
"""
print("WHY 95% CI AND p = 0.05 ARE EQUIVALENT")
print("=" * 50)
print()
print("The 95% CI contains all values θ where:")
print(" p-value for testing H₀: μ = θ would be ≥ 0.05")
print()
print("Therefore:")
print(" • If 0 is OUTSIDE the CI → p < 0.05 for H₀: μ = 0")
print(" • If 0 is INSIDE the CI → p ≥ 0.05 for H₀: μ = 0")
print()
print("The CI is an INVERTED hypothesis test:")
print(" It's the set of null values you would NOT reject")
print()
print("Similarly:")
print(" 99% CI ↔ p = 0.01")
print(" 90% CI ↔ p = 0.10")
explain_equivalence()
Common "Conflicts" (That Aren't Really Conflicts)
Scenario 1: Overlapping Group CIs
The most common source of confusion.
def overlapping_ci_scenario():
"""
Show why overlapping group CIs can still show significant difference.
"""
np.random.seed(42)
n = 50
control = np.random.normal(100, 15, n)
treatment = np.random.normal(107, 15, n)
# Individual group CIs
def group_ci(data):
m = np.mean(data)
se = np.std(data, ddof=1) / np.sqrt(len(data))
return (m - 1.96*se, m + 1.96*se)
control_ci = group_ci(control)
treatment_ci = group_ci(treatment)
# CI for the difference
diff = np.mean(treatment) - np.mean(control)
se_diff = np.sqrt(np.var(control, ddof=1)/n + np.var(treatment, ddof=1)/n)
diff_ci = (diff - 1.96*se_diff, diff + 1.96*se_diff)
# P-value
_, p = stats.ttest_ind(control, treatment)
print("OVERLAPPING GROUP CIs ≠ NON-SIGNIFICANT DIFFERENCE")
print("=" * 60)
print()
print("Individual Group CIs:")
print(f" Control: [{control_ci[0]:.1f}, {control_ci[1]:.1f}]")
print(f" Treatment: [{treatment_ci[0]:.1f}, {treatment_ci[1]:.1f}]")
print()
# Check overlap
overlap = control_ci[1] > treatment_ci[0] and treatment_ci[1] > control_ci[0]
print(f"Do the group CIs overlap? {overlap}")
print()
print("CI for the DIFFERENCE:")
print(f" Difference: {diff:.1f}")
print(f" 95% CI: [{diff_ci[0]:.1f}, {diff_ci[1]:.1f}]")
print(f" Excludes 0: {diff_ci[0] > 0 or diff_ci[1] < 0}")
print()
print(f"P-value: {p:.4f}")
print(f"Significant: {p < 0.05}")
print()
print("KEY INSIGHT:")
print(" The SE of a difference is SMALLER than you'd guess")
print(" from looking at individual CIs.")
print()
print(" SE_diff = √(SE₁² + SE₂²)")
print(" NOT SE₁ + SE₂")
print()
print(" So difference can be significant even with overlapping group CIs!")
overlapping_ci_scenario()
Scenario 2: One-Sided vs. Two-Sided
def one_vs_two_sided():
"""
Show mismatch between one-sided p-value and two-sided CI.
"""
np.random.seed(42)
n = 30
control = np.random.normal(100, 15, n)
treatment = np.random.normal(106, 15, n)
diff = np.mean(treatment) - np.mean(control)
se = np.sqrt(np.var(control, ddof=1)/n + np.var(treatment, ddof=1)/n)
# Two-sided test and CI
t_stat = diff / se
p_two = 2 * (1 - stats.t.cdf(abs(t_stat), 2*n-2))
ci_95 = (diff - 1.96*se, diff + 1.96*se)
# One-sided test (treatment > control)
p_one = 1 - stats.t.cdf(t_stat, 2*n-2)
# One-sided CI (lower bound only)
ci_90_lower = diff - stats.t.ppf(0.95, 2*n-2) * se
print("ONE-SIDED vs. TWO-SIDED")
print("=" * 50)
print()
print(f"Difference: {diff:.2f}")
print()
print("TWO-SIDED:")
print(f" 95% CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]")
print(f" P-value: {p_two:.4f}")
print(f" Significant at α = 0.05: {p_two < 0.05}")
print()
print("ONE-SIDED (H₁: treatment > control):")
print(f" P-value: {p_one:.4f}")
print(f" Significant at α = 0.05: {p_one < 0.05}")
print(f" 95% lower bound: {ci_90_lower:.2f}")
print()
print("APPARENT CONFLICT?")
print(" The 95% two-sided CI might include 0")
print(" But one-sided p-value < 0.05")
print()
print("RESOLUTION:")
print(" Two-sided 95% CI ↔ Two-sided p at α = 0.05")
print(" One-sided p at α = 0.05 ↔ One-sided 90% CI")
print(" (or lower/upper bound of 90% two-sided CI)")
one_vs_two_sided()
Scenario 3: Different Alpha Levels
def different_alpha_levels():
"""
Show mismatch from different alpha levels.
"""
np.random.seed(42)
n = 50
data = np.random.normal(105, 30, n)
mean = np.mean(data)
se = np.std(data, ddof=1) / np.sqrt(n)
# Test against H₀: μ = 100
t_stat = (mean - 100) / se
p = 2 * (1 - stats.t.cdf(abs(t_stat), n-1))
# Different CIs
ci_90 = (mean - 1.645*se, mean + 1.645*se)
ci_95 = (mean - 1.96*se, mean + 1.96*se)
ci_99 = (mean - 2.576*se, mean + 2.576*se)
print("DIFFERENT ALPHA LEVELS")
print("=" * 50)
print()
print(f"Sample mean: {mean:.2f}")
print(f"P-value (vs. H₀: μ = 100): {p:.4f}")
print()
print("CIs at different levels:")
print(f" 90% CI: [{ci_90[0]:.2f}, {ci_90[1]:.2f}] (excludes 100: {ci_90[0] > 100 or ci_90[1] < 100})")
print(f" 95% CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}] (excludes 100: {ci_95[0] > 100 or ci_95[1] < 100})")
print(f" 99% CI: [{ci_99[0]:.2f}, {ci_99[1]:.2f}] (excludes 100: {ci_99[0] > 100 or ci_99[1] < 100})")
print()
print("MATCHING:")
print(f" p < 0.10: {p < 0.10} ↔ 90% CI excludes 100: {ci_90[0] > 100 or ci_90[1] < 100}")
print(f" p < 0.05: {p < 0.05} ↔ 95% CI excludes 100: {ci_95[0] > 100 or ci_95[1] < 100}")
print(f" p < 0.01: {p < 0.01} ↔ 99% CI excludes 100: {ci_99[0] > 100 or ci_99[1] < 100}")
different_alpha_levels()
Resolution Guide
def resolution_guide():
"""
Guide for resolving apparent CI/p-value conflicts.
"""
print("RESOLUTION GUIDE")
print("=" * 60)
scenarios = {
"Group CIs overlap but difference is significant": {
"why": "Comparing apples to oranges",
"fix": "Look at CI for the DIFFERENCE, not individual group CIs"
},
"One-sided p < 0.05 but 95% CI includes null": {
"why": "Mixing one-sided and two-sided",
"fix": "Use one-sided CI or 90% two-sided CI for one-sided test"
},
"P-value says significant but CI seems to include null": {
"why": "Possibly different alpha levels",
"fix": "Match CI confidence level to alpha (95% CI ↔ α = 0.05)"
},
"Different software gives different results": {
"why": "Different approximations or methods",
"fix": "Check what exact method each uses; prefer same method for both"
},
"CI excludes null but effect seems trivial": {
"why": "This is not a conflict - it's the significance/importance gap",
"fix": "CI and p agree on significance; you're questioning practical significance"
}
}
for scenario, resolution in scenarios.items():
print(f"\nScenario: {scenario}")
print(f" Why: {resolution['why']}")
print(f" Fix: {resolution['fix']}")
resolution_guide()
Checking for True Agreement
def check_agreement(diff, se, null=0, alpha=0.05, sided='two'):
"""
Check that CI and p-value agree.
"""
# P-value
z = (diff - null) / se
if sided == 'two':
p = 2 * (1 - stats.norm.cdf(abs(z)))
z_crit = stats.norm.ppf(1 - alpha/2)
else:
p = 1 - stats.norm.cdf(z)
z_crit = stats.norm.ppf(1 - alpha)
# CI
if sided == 'two':
ci = (diff - z_crit * se, diff + z_crit * se)
ci_excludes_null = ci[0] > null or ci[1] < null
else:
ci_lower = diff - z_crit * se
ci = (ci_lower, float('inf'))
ci_excludes_null = ci_lower > null
# Agreement check
p_significant = p < alpha
print("AGREEMENT CHECK")
print("-" * 40)
print(f"Null hypothesis value: {null}")
print(f"Alpha level: {alpha}")
print(f"Test type: {sided}-sided")
print()
print(f"Effect: {diff:.3f}")
print(f"Standard Error: {se:.3f}")
print()
print(f"P-value: {p:.4f}")
print(f" Significant (p < {alpha}): {p_significant}")
print()
print(f"CI: [{ci[0]:.3f}, {ci[1] if ci[1] != float('inf') else '∞'}]")
print(f" Excludes {null}: {ci_excludes_null}")
print()
if p_significant == ci_excludes_null:
print("✓ AGREEMENT: CI and p-value tell the same story")
else:
print("⚠ DISAGREEMENT: Check your calculations!")
return {'p': p, 'ci': ci, 'agree': p_significant == ci_excludes_null}
# Example
check_agreement(diff=5.2, se=2.1, null=0, alpha=0.05, sided='two')
Common Pitfalls
def common_pitfalls():
"""
Common mistakes that create apparent conflicts.
"""
print("COMMON PITFALLS")
print("=" * 50)
pitfalls = {
"Eyeballing overlapping CIs": {
"mistake": "Assuming overlap means not significant",
"reality": "CIs can overlap and difference still be significant",
"prevention": "Always calculate CI for the difference directly"
},
"Using different software": {
"mistake": "Getting p-value from one tool, CI from another",
"reality": "Methods may differ (exact vs. approximate, etc.)",
"prevention": "Use same tool/method for both"
},
"Rounding errors": {
"mistake": "CI just touching 0, p-value just under 0.05",
"reality": "Rounding can make them seem to disagree",
"prevention": "Use more decimal places for comparison"
},
"Wrong CI for the question": {
"mistake": "Testing difference but looking at individual CIs",
"reality": "Need CI for what you're testing",
"prevention": "Match CI to the parameter in your hypothesis"
}
}
for name, details in pitfalls.items():
print(f"\n{name}:")
print(f" Mistake: {details['mistake']}")
print(f" Reality: {details['reality']}")
print(f" Prevention: {details['prevention']}")
common_pitfalls()
R Implementation
# Checking CI/p-value agreement in R
check_agreement <- function(diff, se, null = 0, alpha = 0.05) {
cat("AGREEMENT CHECK\n")
cat(rep("=", 40), "\n\n")
# P-value (two-sided)
z <- (diff - null) / se
p <- 2 * (1 - pnorm(abs(z)))
# 95% CI
z_crit <- qnorm(1 - alpha/2)
ci_low <- diff - z_crit * se
ci_high <- diff + z_crit * se
# Check
p_sig <- p < alpha
ci_excludes <- (ci_low > null) | (ci_high < null)
cat(sprintf("Effect: %.3f (SE: %.3f)\n", diff, se))
cat(sprintf("P-value: %.4f (sig at alpha = %.2f: %s)\n", p, alpha, p_sig))
cat(sprintf("95%% CI: [%.3f, %.3f] (excludes %.1f: %s)\n",
ci_low, ci_high, null, ci_excludes))
cat("\n")
if (p_sig == ci_excludes) {
cat("✓ AGREEMENT\n")
} else {
cat("⚠ DISAGREEMENT - check calculations!\n")
}
}
# Usage:
# check_agreement(diff = 5.2, se = 2.1)
Related Methods
- Effect Sizes Master Guide — The pillar article
- P-Values vs CIs — Understanding both
- Practical Significance — Beyond significance
Key Takeaway
When CIs and p-values seem to disagree, it's almost always because you're comparing mismatched quantities: overlapping group CIs (when you need the CI for the difference), one-sided tests with two-sided CIs, or different alpha levels. For two-sided tests at α = 0.05, a 95% CI that excludes the null ALWAYS corresponds to p < 0.05. When in doubt, compute both using the same method and verify they match—if they don't, something's wrong with your calculation.
References
- https://doi.org/10.1136/bmj.d2304
- https://www.jstor.org/stable/2683359
- Schenker, N., & Gentleman, J. F. (2001). On judging the significance of differences by examining the overlap between confidence intervals. *The American Statistician*, 55(3), 182-186.
- Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to read pictures of data. *American Psychologist*, 60(2), 170-180.
- Greenland, S., et al. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. *European Journal of Epidemiology*, 31(4), 337-350.
Frequently Asked Questions
Can a 95% CI exclude 0 but p > 0.05?
Why do overlapping confidence intervals sometimes show significant difference?
What if I have a one-sided test but a two-sided CI?
Key Takeaway
When CIs and p-values seem to disagree, it's usually a comparison mismatch: overlapping group CIs vs. CI for difference, one-sided vs. two-sided tests, or different alpha levels. For two-sided tests against null = 0, a 95% CI that excludes 0 always corresponds to p < 0.05. When in doubt, focus on the CI for the parameter you care about.