Contents
P-Values vs. Confidence Intervals: How to Interpret Both for Decisions
Understand the relationship between p-values and confidence intervals, when they agree, when they seem to disagree, and how to use them together for better decisions.
Quick Hits
- •P-values tell you probability of data under null; CIs tell you plausible parameter values
- •95% CI excludes 0 ⟺ p < 0.05 (for testing against null of 0)
- •CIs give more information: direction, magnitude, and precision
- •For decisions, CI bounds matter more than p-values
TL;DR
P-values tell you the probability of seeing your data (or more extreme) if the null hypothesis is true. Confidence intervals give you a range of plausible values for the true parameter. They're mathematically linked: a 95% CI that excludes zero corresponds to p < 0.05. But CIs are more informative for decisions because they show effect magnitude and precision, not just whether to reject the null.
What Each Tells You
P-Values
import numpy as np
from scipy import stats
def explain_p_value():
"""
Clarify what a p-value actually means.
"""
print("WHAT A P-VALUE TELLS YOU")
print("=" * 50)
print()
print("P-value = P(data this extreme or more | H₀ is true)")
print()
print("In plain English:")
print(" 'If there were truly no effect, what's the probability")
print(" of seeing results as extreme as what we observed?'")
print()
print("P < 0.05 means:")
print(" 'This would be surprising if H₀ were true'")
print(" 'We reject H₀ at the 0.05 level'")
print()
print("P-VALUE DOES NOT MEAN:")
print(" ✗ P(H₀ is true)")
print(" ✗ P(H₁ is true)")
print(" ✗ Probability the effect is real")
print(" ✗ Size of the effect")
print()
print("WHAT P-VALUES DON'T TELL YOU:")
print(" • How big the effect is")
print(" • Whether the effect matters practically")
print(" • The probability of replication")
explain_p_value()
Confidence Intervals
def explain_confidence_interval():
"""
Clarify what a confidence interval means.
"""
print("WHAT A CONFIDENCE INTERVAL TELLS YOU")
print("=" * 50)
print()
print("95% CI: A range constructed such that if we repeated")
print("the study many times, 95% of such intervals would")
print("contain the true parameter value.")
print()
print("In practice:")
print(" 'We're 95% confident the true effect is in this range'")
print()
print("WHAT A CI TELLS YOU:")
print(" ✓ Plausible values for the true effect")
print(" ✓ Precision of the estimate (narrow = precise)")
print(" ✓ Whether the effect is significant (if 0 excluded)")
print(" ✓ Whether the effect might be practically important")
print()
print("CI DOES NOT MEAN:")
print(" ✗ 95% of the data falls in this range")
print(" ✗ 95% probability the true value is in THIS interval")
print(" (The true value is fixed; it's either in or out)")
explain_confidence_interval()
The Mathematical Relationship
They're Two Sides of the Same Coin
def demonstrate_relationship():
"""
Show the mathematical link between p-values and CIs.
"""
np.random.seed(42)
# Generate data
control = np.random.normal(100, 15, 50)
treatment = np.random.normal(108, 15, 50)
diff = np.mean(treatment) - np.mean(control)
se = np.sqrt(np.var(control, ddof=1)/50 + np.var(treatment, ddof=1)/50)
# P-value (two-sided test against H₀: diff = 0)
t_stat = diff / se
df = 98 # Approximately
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
# 95% CI
t_crit = stats.t.ppf(0.975, df)
ci_low = diff - t_crit * se
ci_high = diff + t_crit * se
print("P-VALUE AND CI RELATIONSHIP")
print("=" * 50)
print()
print(f"Observed difference: {diff:.2f}")
print(f"Standard error: {se:.2f}")
print()
print(f"P-value: {p_value:.4f}")
print(f" → p {'<' if p_value < 0.05 else '≥'} 0.05")
print()
print(f"95% CI: [{ci_low:.2f}, {ci_high:.2f}]")
print(f" → 0 is {'NOT ' if ci_low > 0 or ci_high < 0 else ''}in CI")
print()
print("THE KEY RELATIONSHIP:")
print(" 95% CI excludes 0 ⟺ p < 0.05")
print(" 99% CI excludes 0 ⟺ p < 0.01")
print(" (1-α)% CI excludes H₀ ⟺ p < α")
demonstrate_relationship()
Visual Demonstration
def visualize_ci_pvalue_link():
"""
Show how CI relates to p-value visually.
"""
import matplotlib.pyplot as plt
np.random.seed(42)
# Three scenarios
scenarios = [
{'diff': 10, 'se': 3, 'label': 'Significant'},
{'diff': 3, 'se': 3, 'label': 'Borderline'},
{'diff': 1, 'se': 3, 'label': 'Not significant'}
]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, s in zip(axes, scenarios):
ci_low = s['diff'] - 1.96 * s['se']
ci_high = s['diff'] + 1.96 * s['se']
p_val = 2 * (1 - stats.norm.cdf(abs(s['diff'] / s['se'])))
# Plot CI
ax.errorbar(s['diff'], 0, xerr=[[s['diff'] - ci_low], [ci_high - s['diff']]],
fmt='o', capsize=10, markersize=10, capthick=2)
ax.axvline(0, color='red', linestyle='--', label='Null (H₀: diff=0)')
ax.set_xlim(-10, 20)
ax.set_ylim(-0.5, 0.5)
ax.set_title(f"{s['label']}\nDiff={s['diff']}, p={p_val:.3f}")
ax.set_xlabel('Effect Size')
ax.legend()
plt.tight_layout()
return fig
Why CIs Are Often More Useful
CI Tells You What P-Value Can't
def ci_advantages():
"""
Demonstrate advantages of CIs over p-values.
"""
np.random.seed(42)
# Two scenarios with same p-value, very different implications
scenarios = [
{
'name': 'Precise estimate',
'diff': 5,
'se': 2.5,
'n': 500
},
{
'name': 'Imprecise estimate',
'diff': 5,
'se': 2.5,
'n': 50 # Same result but from smaller sample
}
]
print("TWO SCENARIOS WITH SAME P-VALUE")
print("=" * 60)
for s in scenarios:
ci_low = s['diff'] - 1.96 * s['se']
ci_high = s['diff'] + 1.96 * s['se']
p_val = 2 * (1 - stats.norm.cdf(abs(s['diff'] / s['se'])))
print(f"\n{s['name']} (n = {s['n']}):")
print(f" Observed difference: {s['diff']}")
print(f" P-value: {p_val:.4f}")
print(f" 95% CI: [{ci_low:.2f}, {ci_high:.2f}]")
print()
print("SAME P-VALUE, BUT CI TELLS YOU:")
print(" • CI width shows precision")
print(" • CI bounds show plausible effect range")
print(" • You can assess practical significance from CI bounds")
ci_advantages()
def ci_for_decision_making():
"""
How to use CIs for decisions.
"""
print("\nUSING CIs FOR DECISIONS")
print("=" * 60)
print()
scenarios = {
'CI fully above threshold': {
'ci': (3, 7),
'threshold': 2,
'decision': 'Implement - effect definitely exceeds threshold'
},
'CI overlaps threshold': {
'ci': (1, 5),
'threshold': 2,
'decision': 'Uncertain - need more data or consider risk tolerance'
},
'CI fully below threshold': {
'ci': (0.5, 2.5),
'threshold': 3,
'decision': 'Don\'t implement - effect likely below threshold'
},
'CI contains zero but above threshold possible': {
'ci': (-1, 4),
'threshold': 2,
'decision': 'Not significant, but practical effect possible - more data needed'
}
}
for name, info in scenarios.items():
print(f"\n{name}:")
print(f" CI: [{info['ci'][0]}, {info['ci'][1]}]")
print(f" Practical threshold: {info['threshold']}")
print(f" Decision: {info['decision']}")
ci_for_decision_making()
Interpreting Together
The Complete Picture
def interpret_together(diff, se, threshold=None, alpha=0.05):
"""
Interpret p-value and CI together for decisions.
"""
# Calculate statistics
z = diff / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
z_crit = stats.norm.ppf(1 - alpha/2)
ci_low = diff - z_crit * se
ci_high = diff + z_crit * se
print("INTEGRATED INTERPRETATION")
print("=" * 60)
print()
print(f"Point estimate: {diff:.3f}")
print(f"95% CI: [{ci_low:.3f}, {ci_high:.3f}]")
print(f"P-value: {p_value:.4f}")
print()
print("STATISTICAL SIGNIFICANCE:")
if p_value < alpha:
print(f" ✓ Significant at α = {alpha}")
print(f" (CI does not include 0)")
else:
print(f" ✗ Not significant at α = {alpha}")
print(f" (CI includes 0)")
if threshold:
print()
print("PRACTICAL SIGNIFICANCE:")
if ci_low > threshold:
print(f" ✓ Definitely exceeds threshold ({threshold})")
print(f" (Entire CI above threshold)")
elif ci_high < threshold:
print(f" ✗ Definitely below threshold ({threshold})")
print(f" (Entire CI below threshold)")
else:
print(f" ? Uncertain relative to threshold ({threshold})")
print(f" (CI overlaps threshold)")
print()
print("RECOMMENDATION:")
if p_value < alpha and threshold and ci_low > threshold:
print(" Strong evidence for meaningful effect - implement")
elif p_value < alpha and threshold and ci_high < threshold:
print(" Significant but below threshold - may not be worth implementing")
elif p_value < alpha:
print(" Significant - examine CI to assess practical importance")
elif threshold and ci_high > threshold:
print(" Not significant, but practical effect still possible - gather more data")
else:
print(" No significant effect and unlikely to be practically important")
# Examples
print("\n" + "="*70)
print("SCENARIO 1: Clearly beneficial")
print("="*70)
interpret_together(diff=10, se=3, threshold=5)
print("\n" + "="*70)
print("SCENARIO 2: Significant but possibly trivial")
print("="*70)
interpret_together(diff=2, se=0.5, threshold=5)
print("\n" + "="*70)
print("SCENARIO 3: Not significant but potentially meaningful")
print("="*70)
interpret_together(diff=5, se=4, threshold=5)
Common Misinterpretations
def common_misinterpretations():
"""
Address common misunderstandings.
"""
print("COMMON MISINTERPRETATIONS TO AVOID")
print("=" * 60)
misinterpretations = {
'P-value myths': [
('p = 0.03 means 3% chance null is true',
'P-value is P(data|H₀), not P(H₀|data)'),
('p = 0.03 is "more significant" than p = 0.04',
'Both just suggest H₀ is unlikely; don\'t over-interpret small differences'),
('p > 0.05 means no effect exists',
'It means we can\'t rule out chance, not that effect is zero'),
],
'CI myths': [
('95% probability true value is in this interval',
'True value is fixed; either it\'s in there or not'),
('95% of data falls in this interval',
'CI is about parameter estimate, not data spread'),
('Overlapping CIs mean no significant difference',
'CIs can overlap but groups still differ significantly'),
]
}
for category, myths in misinterpretations.items():
print(f"\n{category}:")
print("-" * 50)
for myth, reality in myths:
print(f"\n ✗ WRONG: {myth}")
print(f" ✓ RIGHT: {reality}")
common_misinterpretations()
Practical Decision Framework
def decision_framework():
"""
Framework for using p-values and CIs in decisions.
"""
print("""
DECISION FRAMEWORK
==================
STEP 1: Define what matters BEFORE analysis
• What's the minimum effect that would change your decision?
• What's the acceptable risk of false positive/negative?
STEP 2: Look at CI first
• What range of effects is plausible?
• Does the CI include practically important effects?
• Does the CI include trivial effects?
STEP 3: Consider p-value
• Is the result statistically significant?
• If significant but CI includes trivial effects: beware over-interpretation
• If not significant but CI includes important effects: consider getting more data
STEP 4: Make decision
Scenario A: CI entirely in "actionable" range, p < α
→ Strong evidence to act
Scenario B: CI entirely in "trivial" range, p < α
→ Significant but not worth acting on
Scenario C: CI in "actionable" range, p > α
→ Promising but uncertain; consider more data
Scenario D: CI entirely in "trivial" range, p > α
→ No evidence of meaningful effect
Scenario E: CI spans trivial and actionable
→ Inconclusive; more data needed for confident decision
""")
decision_framework()
R Implementation
# P-value and CI interpretation in R
interpret_result <- function(diff, se, threshold = NULL, alpha = 0.05) {
z <- diff / se
p_value <- 2 * (1 - pnorm(abs(z)))
z_crit <- qnorm(1 - alpha/2)
ci_low <- diff - z_crit * se
ci_high <- diff + z_crit * se
cat("INTEGRATED INTERPRETATION\n")
cat(rep("=", 50), "\n\n")
cat(sprintf("Point estimate: %.3f\n", diff))
cat(sprintf("95%% CI: [%.3f, %.3f]\n", ci_low, ci_high))
cat(sprintf("P-value: %.4f\n", p_value))
cat("\nStatistical significance:\n")
if (p_value < alpha) {
cat(sprintf(" Significant at alpha = %.2f\n", alpha))
} else {
cat(sprintf(" Not significant at alpha = %.2f\n", alpha))
}
if (!is.null(threshold)) {
cat("\nPractical significance:\n")
if (ci_low > threshold) {
cat(sprintf(" Definitely exceeds threshold (%.1f)\n", threshold))
} else if (ci_high < threshold) {
cat(sprintf(" Definitely below threshold (%.1f)\n", threshold))
} else {
cat(sprintf(" Uncertain relative to threshold (%.1f)\n", threshold))
}
}
invisible(list(p_value = p_value, ci = c(ci_low, ci_high)))
}
# Usage:
# interpret_result(diff = 5, se = 2, threshold = 3)
Related Methods
- Effect Sizes Master Guide — The pillar article
- When CIs and P-Values Disagree — Resolving apparent conflicts
- Practical Significance Thresholds — Setting meaningful thresholds
Key Takeaway
P-values and confidence intervals are mathematically linked but serve different purposes. P-values address whether an effect exists (statistical significance). CIs show how big it might be and with what precision. For decisions, focus on the CI: Does it contain only trivial effects? Only meaningful effects? Both? This determines your action more reliably than whether p crosses 0.05. Report both, but let the CI guide your practical interpretation.
References
- https://doi.org/10.1038/d41586-019-00857-9
- https://www.jstor.org/stable/2684655
- Cumming, G., & Finch, S. (2005). Inference by eye: confidence intervals and how to read pictures of data. *American Psychologist*, 60(2), 170-180.
- Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. *European Journal of Epidemiology*, 31(4), 337-350.
- Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. *The American Statistician*, 70(2), 129-133.
Frequently Asked Questions
Which is better, p-values or confidence intervals?
Why do some statisticians want to abandon p-values?
Can a 95% CI contain 0 but p > 0.05?
Key Takeaway
P-values and confidence intervals are mathematically related but answer different questions. P-values ask 'does it exist?' while CIs ask 'how big and how precise?' For decision-making, CIs are usually more informative because they show the range of plausible effect sizes. Use both together: p-values for significance, CIs for practical interpretation.