Effect Sizes

P-Values vs. Confidence Intervals: How to Interpret Both for Decisions

Understand the relationship between p-values and confidence intervals, when they agree, when they seem to disagree, and how to use them together for better decisions.

Jan 268 min readstatstest_flow Effect Sizes Supporting

P-Values vs. Confidence Intervals: How to Interpret Both for Decisions

Quick Hits

•P-values tell you probability of data under null; CIs tell you plausible parameter values
•95% CI excludes 0 ⟺ p < 0.05 (for testing against null of 0)
•CIs give more information: direction, magnitude, and precision
•For decisions, CI bounds matter more than p-values

TL;DR

P-values tell you the probability of seeing your data (or more extreme) if the null hypothesis is true. Confidence intervals give you a range of plausible values for the true parameter. They're mathematically linked: a 95% CI that excludes zero corresponds to $p < 0.05$ . But CIs are more informative for decisions because they show effect magnitude and precision, not just whether to reject the null.

What Each Tells You

P-Values

P-value = P(data this extreme or more | H₀ is true)

In plain English: "If there were truly no effect, what's the probability of seeing results as extreme as what we observed?"

P < 0.05 means:

"This would be surprising if H₀ were true"
"We reject H₀ at the 0.05 level"

A p-value does NOT mean:

P(H₀ is true)
P(H₁ is true)
Probability the effect is real
Size of the effect

What p-values don't tell you:

How big the effect is
Whether the effect matters practically
The probability of replication

Confidence Intervals

A 95% CI is a range constructed such that if we repeated the study many times, 95% of such intervals would contain the true parameter value. In practice: "We're 95% confident the true effect is in this range."

What a CI tells you:

Plausible values for the true effect
Precision of the estimate (narrow = precise)
Whether the effect is significant (if 0 is excluded)
Whether the effect might be practically important

A CI does NOT mean:

95% of the data falls in this range
95% probability the true value is in THIS interval (the true value is fixed; it's either in there or not)

The Mathematical Relationship

They're Two Sides of the Same Coin

import numpy as np
from scipy import stats

np.random.seed(42)

# Generate data
control = np.random.normal(100, 15, 50)
treatment = np.random.normal(108, 15, 50)

diff = np.mean(treatment) - np.mean(control)
se = np.sqrt(np.var(control, ddof=1)/50 + np.var(treatment, ddof=1)/50)

# P-value (two-sided test against H₀: diff = 0)
t_stat = diff / se
df = 98
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))

# 95% CI
t_crit = stats.t.ppf(0.975, df)
ci_low = diff - t_crit * se
ci_high = diff + t_crit * se

The observed difference and its standard error feed into both the p-value and the CI. When the 95% CI excludes zero, it means p < 0.05, and vice versa.

The key relationship:

95% CI excludes 0 ⟺ p < 0.05
99% CI excludes 0 ⟺ p < 0.01
(1-α)% CI excludes H₀ ⟺ p < α

Visual Demonstration

def visualize_ci_pvalue_link():
    """
    Show how CI relates to p-value visually.
    """
    import matplotlib.pyplot as plt

    np.random.seed(42)

    # Three scenarios
    scenarios = [
        {'diff': 10, 'se': 3, 'label': 'Significant'},
        {'diff': 3, 'se': 3, 'label': 'Borderline'},
        {'diff': 1, 'se': 3, 'label': 'Not significant'}
    ]

    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    for ax, s in zip(axes, scenarios):
        ci_low = s['diff'] - 1.96 * s['se']
        ci_high = s['diff'] + 1.96 * s['se']
        p_val = 2 * (1 - stats.norm.cdf(abs(s['diff'] / s['se'])))

        # Plot CI
        ax.errorbar(s['diff'], 0, xerr=[[s['diff'] - ci_low], [ci_high - s['diff']]],
                   fmt='o', capsize=10, markersize=10, capthick=2)
        ax.axvline(0, color='red', linestyle='--', label='Null (H₀: diff=0)')
        ax.set_xlim(-10, 20)
        ax.set_ylim(-0.5, 0.5)
        ax.set_title(f"{s['label']}\nDiff={s['diff']}, p={p_val:.3f}")
        ax.set_xlabel('Effect Size')
        ax.legend()

    plt.tight_layout()
    return fig

Why CIs Are Often More Useful

CI Tells You What P-Value Can't

Consider two scenarios with the same p-value (p = 0.0455) but different sample sizes:

Precise estimate (n = 500): Observed difference = 5, 95% CI: [0.10, 9.90]
Imprecise estimate (n = 50): Observed difference = 5, 95% CI: [0.10, 9.90]

Both have the same p-value, but the CI tells you more:

CI width shows precision
CI bounds show the plausible effect range
You can assess practical significance from CI bounds

Using CIs for Decisions

CI fully above threshold — CI: [3, 7], threshold: 2 → Implement. The effect definitely exceeds the threshold.
CI overlaps threshold — CI: [1, 5], threshold: 2 → Uncertain. Need more data or consider risk tolerance.
CI fully below threshold — CI: [0.5, 2.5], threshold: 3 → Don't implement. The effect is likely below the threshold.
CI contains zero but above threshold possible — CI: [-1, 4], threshold: 2 → Not significant, but a practical effect is still possible. More data needed.

Interpreting Together

The Complete Picture

def interpret_together(diff, se, threshold=None, alpha=0.05):
    z = diff / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    z_crit = stats.norm.ppf(1 - alpha/2)
    ci_low = diff - z_crit * se
    ci_high = diff + z_crit * se
    return {'diff': diff, 'p_value': p_value, 'ci': (ci_low, ci_high)}

First check statistical significance: if the CI does not include 0, the result is significant at $\alpha$ . Then check practical significance against a pre-defined threshold:

CI entirely above threshold → Definitely exceeds it. Strong evidence to implement.
CI entirely below threshold → Definitely below it. Significant but may not be worth implementing.
CI overlaps threshold → Uncertain. May need more data.

Scenario 1 — Clearly beneficial (diff=10, se=3, threshold=5): Point estimate of 10, 95% CI [4.12, 15.88], p < 0.001. The entire CI is above the threshold. Strong evidence for a meaningful effect.

Scenario 2 — Significant but possibly trivial (diff=2, se=0.5, threshold=5): Point estimate of 2, 95% CI [1.02, 2.98], p < 0.001. Statistically significant, but the entire CI falls below the practical threshold of 5. May not be worth implementing.

Scenario 3 — Not significant but potentially meaningful (diff=5, se=4, threshold=5): Point estimate of 5, 95% CI [-2.84, 12.84], p = 0.211. Not significant, but the CI includes large positive effects above the threshold. Consider gathering more data.

Common Misinterpretations

P-value myths:

Wrong: "p = 0.03 means 3% chance the null is true." Right: P-value is P(data|H₀), not P(H₀|data).
Wrong: "p = 0.03 is 'more significant' than p = 0.04." Right: Both just suggest H₀ is unlikely; don't over-interpret small differences.
Wrong: "p > 0.05 means no effect exists." Right: It means we can't rule out chance, not that the effect is zero.

CI myths:

Wrong: "95% probability the true value is in this interval." Right: The true value is fixed; either it's in there or not.
Wrong: "95% of the data falls in this interval." Right: The CI is about the parameter estimate, not data spread.
Wrong: "Overlapping CIs mean no significant difference." Right: CIs can overlap but groups can still differ significantly.

Practical Decision Framework

Step 1: Define what matters BEFORE analysis.

What's the minimum effect that would change your decision?
What's the acceptable risk of false positive/negative?

Step 2: Look at the CI first.

What range of effects is plausible?
Does the CI include practically important effects?
Does the CI include trivial effects?

Step 3: Consider the p-value.

Is the result statistically significant?
If significant but the CI includes trivial effects: beware over-interpretation.
If not significant but the CI includes important effects: consider getting more data.

Step 4: Make a decision.

Scenario A: CI entirely in "actionable" range, p < α → Strong evidence to act.
Scenario B: CI entirely in "trivial" range, p < α → Significant but not worth acting on.
Scenario C: CI in "actionable" range, p > α → Promising but uncertain; consider more data.
Scenario D: CI entirely in "trivial" range, p > α → No evidence of meaningful effect.
Scenario E: CI spans trivial and actionable → Inconclusive; more data needed for confident decision.

R Implementation

# P-value and CI interpretation in R

interpret_result <- function(diff, se, threshold = NULL, alpha = 0.05) {
  z <- diff / se
  p_value <- 2 * (1 - pnorm(abs(z)))
  z_crit <- qnorm(1 - alpha/2)
  ci_low <- diff - z_crit * se
  ci_high <- diff + z_crit * se

  cat("INTEGRATED INTERPRETATION\n")
  cat(rep("=", 50), "\n\n")

  cat(sprintf("Point estimate: %.3f\n", diff))
  cat(sprintf("95%% CI: [%.3f, %.3f]\n", ci_low, ci_high))
  cat(sprintf("P-value: %.4f\n", p_value))

  cat("\nStatistical significance:\n")
  if (p_value < alpha) {
    cat(sprintf("  Significant at alpha = %.2f\n", alpha))
  } else {
    cat(sprintf("  Not significant at alpha = %.2f\n", alpha))
  }

  if (!is.null(threshold)) {
    cat("\nPractical significance:\n")
    if (ci_low > threshold) {
      cat(sprintf("  Definitely exceeds threshold (%.1f)\n", threshold))
    } else if (ci_high < threshold) {
      cat(sprintf("  Definitely below threshold (%.1f)\n", threshold))
    } else {
      cat(sprintf("  Uncertain relative to threshold (%.1f)\n", threshold))
    }
  }

  invisible(list(p_value = p_value, ci = c(ci_low, ci_high)))
}

# Usage:
# interpret_result(diff = 5, se = 2, threshold = 3)

Effect Sizes Master Guide — The pillar article
When CIs and P-Values Disagree — Resolving apparent conflicts
Practical Significance Thresholds — Setting meaningful thresholds

Key Takeaway

P-values and confidence intervals are mathematically linked but serve different purposes. P-values address whether an effect exists (statistical significance). CIs show how big it might be and with what precision. For decisions, focus on the CI: Does it contain only trivial effects? Only meaningful effects? Both? This determines your action more reliably than whether p crosses 0.05. Report both, but let the CI guide your practical interpretation.

References

https://doi.org/10.1038/d41586-019-00857-9
https://www.jstor.org/stable/2684655
Cumming, G., & Finch, S. (2005). Inference by eye: confidence intervals and how to read pictures of data. *American Psychologist*, 60(2), 170-180.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. *European Journal of Epidemiology*, 31(4), 337-350.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. *The American Statistician*, 70(2), 129-133.

Frequently Asked Questions

Which is better, p-values or confidence intervals?

Both have value. P-values directly address 'is there an effect?' CIs address 'how big and how certain?' For most decisions, CIs are more informative because they show the range of plausible effect sizes.

Why do some statisticians want to abandon p-values?

P-values are often misinterpreted (they don't give P(H₀|data)), encourage binary thinking, and don't communicate effect size or precision. CIs address these issues while still allowing significance assessment.

Can a 95% CI contain 0 but p > 0.05?

No, they're mathematically linked for two-sided tests. If the 95% CI excludes the null value, p < 0.05, and vice versa. Apparent disagreements usually involve comparing different CIs (e.g., separate group CIs vs. CI for difference).

Key Takeaway

P-values and confidence intervals are mathematically related but answer different questions. P-values ask 'does it exist?' while CIs ask 'how big and how precise?' For decision-making, CIs are usually more informative because they show the range of plausible effect sizes. Use both together: p-values for significance, CIs for practical interpretation.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email