Effect Sizes

When Confidence Intervals and P-Values Seem to Disagree

Understand why CIs and p-values sometimes appear to conflict and how to resolve these apparent contradictions. Learn common scenarios and the correct interpretation.

Share

Quick Hits

  • 95% CI excluding 0 ⟺ p < 0.05 for two-sided tests of difference = 0
  • Apparent conflicts usually involve comparing the wrong things
  • Overlapping group CIs ≠ non-significant difference
  • One-sided p-values and two-sided CIs don't match directly

TL;DR

When p-values and confidence intervals seem to contradict each other, there's almost always a mismatch in what's being compared. The core rule: for two-sided tests with α = 0.05, a 95% CI that excludes 0 always means p < 0.05, and vice versa. Common "conflicts" involve overlapping group CIs (which don't imply non-significance), one-sided vs. two-sided comparisons, or different alpha levels.


The Fundamental Relationship

They're Mathematically Linked

import numpy as np
from scipy import stats

def demonstrate_linkage():
    """
    Show the mathematical link between p-values and CIs.
    """
    np.random.seed(42)

    # Generate data
    control = np.random.normal(100, 15, 50)
    treatment = np.random.normal(108, 15, 50)

    diff = np.mean(treatment) - np.mean(control)
    se = np.sqrt(np.var(control, ddof=1)/50 + np.var(treatment, ddof=1)/50)

    # Two-sided t-test
    t_stat, p_value = stats.ttest_ind(control, treatment)

    # 95% CI for difference
    t_crit = stats.t.ppf(0.975, 98)
    ci = (diff - t_crit * se, diff + t_crit * se)

    print("THE FUNDAMENTAL RELATIONSHIP")
    print("=" * 50)
    print()
    print(f"Difference: {diff:.2f}")
    print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")
    print(f"P-value (two-sided): {p_value:.4f}")
    print()

    # Check relationship
    ci_excludes_zero = ci[0] > 0 or ci[1] < 0
    p_significant = p_value < 0.05

    print("Checking relationship:")
    print(f"  95% CI excludes 0: {ci_excludes_zero}")
    print(f"  p < 0.05: {p_significant}")
    print()

    if ci_excludes_zero == p_significant:
        print("✓ They AGREE (as they always must)")
    else:
        print("⚠ Something is wrong (this should never happen)")


demonstrate_linkage()

Why They Must Agree

def explain_equivalence():
    """
    Explain why 95% CI and p = 0.05 are equivalent.
    """
    print("WHY 95% CI AND p = 0.05 ARE EQUIVALENT")
    print("=" * 50)
    print()
    print("The 95% CI contains all values θ where:")
    print("  p-value for testing H₀: μ = θ would be ≥ 0.05")
    print()
    print("Therefore:")
    print("  • If 0 is OUTSIDE the CI → p < 0.05 for H₀: μ = 0")
    print("  • If 0 is INSIDE the CI → p ≥ 0.05 for H₀: μ = 0")
    print()
    print("The CI is an INVERTED hypothesis test:")
    print("  It's the set of null values you would NOT reject")
    print()
    print("Similarly:")
    print("  99% CI ↔ p = 0.01")
    print("  90% CI ↔ p = 0.10")


explain_equivalence()

Common "Conflicts" (That Aren't Really Conflicts)

Scenario 1: Overlapping Group CIs

The most common source of confusion.

def overlapping_ci_scenario():
    """
    Show why overlapping group CIs can still show significant difference.
    """
    np.random.seed(42)

    n = 50
    control = np.random.normal(100, 15, n)
    treatment = np.random.normal(107, 15, n)

    # Individual group CIs
    def group_ci(data):
        m = np.mean(data)
        se = np.std(data, ddof=1) / np.sqrt(len(data))
        return (m - 1.96*se, m + 1.96*se)

    control_ci = group_ci(control)
    treatment_ci = group_ci(treatment)

    # CI for the difference
    diff = np.mean(treatment) - np.mean(control)
    se_diff = np.sqrt(np.var(control, ddof=1)/n + np.var(treatment, ddof=1)/n)
    diff_ci = (diff - 1.96*se_diff, diff + 1.96*se_diff)

    # P-value
    _, p = stats.ttest_ind(control, treatment)

    print("OVERLAPPING GROUP CIs ≠ NON-SIGNIFICANT DIFFERENCE")
    print("=" * 60)
    print()
    print("Individual Group CIs:")
    print(f"  Control: [{control_ci[0]:.1f}, {control_ci[1]:.1f}]")
    print(f"  Treatment: [{treatment_ci[0]:.1f}, {treatment_ci[1]:.1f}]")
    print()

    # Check overlap
    overlap = control_ci[1] > treatment_ci[0] and treatment_ci[1] > control_ci[0]
    print(f"Do the group CIs overlap? {overlap}")

    print()
    print("CI for the DIFFERENCE:")
    print(f"  Difference: {diff:.1f}")
    print(f"  95% CI: [{diff_ci[0]:.1f}, {diff_ci[1]:.1f}]")
    print(f"  Excludes 0: {diff_ci[0] > 0 or diff_ci[1] < 0}")

    print()
    print(f"P-value: {p:.4f}")
    print(f"Significant: {p < 0.05}")

    print()
    print("KEY INSIGHT:")
    print("  The SE of a difference is SMALLER than you'd guess")
    print("  from looking at individual CIs.")
    print()
    print("  SE_diff = √(SE₁² + SE₂²)")
    print("  NOT SE₁ + SE₂")
    print()
    print("  So difference can be significant even with overlapping group CIs!")


overlapping_ci_scenario()

Scenario 2: One-Sided vs. Two-Sided

def one_vs_two_sided():
    """
    Show mismatch between one-sided p-value and two-sided CI.
    """
    np.random.seed(42)

    n = 30
    control = np.random.normal(100, 15, n)
    treatment = np.random.normal(106, 15, n)

    diff = np.mean(treatment) - np.mean(control)
    se = np.sqrt(np.var(control, ddof=1)/n + np.var(treatment, ddof=1)/n)

    # Two-sided test and CI
    t_stat = diff / se
    p_two = 2 * (1 - stats.t.cdf(abs(t_stat), 2*n-2))
    ci_95 = (diff - 1.96*se, diff + 1.96*se)

    # One-sided test (treatment > control)
    p_one = 1 - stats.t.cdf(t_stat, 2*n-2)

    # One-sided CI (lower bound only)
    ci_90_lower = diff - stats.t.ppf(0.95, 2*n-2) * se

    print("ONE-SIDED vs. TWO-SIDED")
    print("=" * 50)
    print()
    print(f"Difference: {diff:.2f}")
    print()
    print("TWO-SIDED:")
    print(f"  95% CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]")
    print(f"  P-value: {p_two:.4f}")
    print(f"  Significant at α = 0.05: {p_two < 0.05}")
    print()
    print("ONE-SIDED (H₁: treatment > control):")
    print(f"  P-value: {p_one:.4f}")
    print(f"  Significant at α = 0.05: {p_one < 0.05}")
    print(f"  95% lower bound: {ci_90_lower:.2f}")
    print()
    print("APPARENT CONFLICT?")
    print("  The 95% two-sided CI might include 0")
    print("  But one-sided p-value < 0.05")
    print()
    print("RESOLUTION:")
    print("  Two-sided 95% CI ↔ Two-sided p at α = 0.05")
    print("  One-sided p at α = 0.05 ↔ One-sided 90% CI")
    print("  (or lower/upper bound of 90% two-sided CI)")


one_vs_two_sided()

Scenario 3: Different Alpha Levels

def different_alpha_levels():
    """
    Show mismatch from different alpha levels.
    """
    np.random.seed(42)

    n = 50
    data = np.random.normal(105, 30, n)

    mean = np.mean(data)
    se = np.std(data, ddof=1) / np.sqrt(n)

    # Test against H₀: μ = 100
    t_stat = (mean - 100) / se
    p = 2 * (1 - stats.t.cdf(abs(t_stat), n-1))

    # Different CIs
    ci_90 = (mean - 1.645*se, mean + 1.645*se)
    ci_95 = (mean - 1.96*se, mean + 1.96*se)
    ci_99 = (mean - 2.576*se, mean + 2.576*se)

    print("DIFFERENT ALPHA LEVELS")
    print("=" * 50)
    print()
    print(f"Sample mean: {mean:.2f}")
    print(f"P-value (vs. H₀: μ = 100): {p:.4f}")
    print()
    print("CIs at different levels:")
    print(f"  90% CI: [{ci_90[0]:.2f}, {ci_90[1]:.2f}]  (excludes 100: {ci_90[0] > 100 or ci_90[1] < 100})")
    print(f"  95% CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]  (excludes 100: {ci_95[0] > 100 or ci_95[1] < 100})")
    print(f"  99% CI: [{ci_99[0]:.2f}, {ci_99[1]:.2f}]  (excludes 100: {ci_99[0] > 100 or ci_99[1] < 100})")
    print()
    print("MATCHING:")
    print(f"  p < 0.10: {p < 0.10} ↔ 90% CI excludes 100: {ci_90[0] > 100 or ci_90[1] < 100}")
    print(f"  p < 0.05: {p < 0.05} ↔ 95% CI excludes 100: {ci_95[0] > 100 or ci_95[1] < 100}")
    print(f"  p < 0.01: {p < 0.01} ↔ 99% CI excludes 100: {ci_99[0] > 100 or ci_99[1] < 100}")


different_alpha_levels()

Resolution Guide

def resolution_guide():
    """
    Guide for resolving apparent CI/p-value conflicts.
    """
    print("RESOLUTION GUIDE")
    print("=" * 60)

    scenarios = {
        "Group CIs overlap but difference is significant": {
            "why": "Comparing apples to oranges",
            "fix": "Look at CI for the DIFFERENCE, not individual group CIs"
        },
        "One-sided p < 0.05 but 95% CI includes null": {
            "why": "Mixing one-sided and two-sided",
            "fix": "Use one-sided CI or 90% two-sided CI for one-sided test"
        },
        "P-value says significant but CI seems to include null": {
            "why": "Possibly different alpha levels",
            "fix": "Match CI confidence level to alpha (95% CI ↔ α = 0.05)"
        },
        "Different software gives different results": {
            "why": "Different approximations or methods",
            "fix": "Check what exact method each uses; prefer same method for both"
        },
        "CI excludes null but effect seems trivial": {
            "why": "This is not a conflict - it's the significance/importance gap",
            "fix": "CI and p agree on significance; you're questioning practical significance"
        }
    }

    for scenario, resolution in scenarios.items():
        print(f"\nScenario: {scenario}")
        print(f"  Why: {resolution['why']}")
        print(f"  Fix: {resolution['fix']}")


resolution_guide()

Checking for True Agreement

def check_agreement(diff, se, null=0, alpha=0.05, sided='two'):
    """
    Check that CI and p-value agree.
    """
    # P-value
    z = (diff - null) / se
    if sided == 'two':
        p = 2 * (1 - stats.norm.cdf(abs(z)))
        z_crit = stats.norm.ppf(1 - alpha/2)
    else:
        p = 1 - stats.norm.cdf(z)
        z_crit = stats.norm.ppf(1 - alpha)

    # CI
    if sided == 'two':
        ci = (diff - z_crit * se, diff + z_crit * se)
        ci_excludes_null = ci[0] > null or ci[1] < null
    else:
        ci_lower = diff - z_crit * se
        ci = (ci_lower, float('inf'))
        ci_excludes_null = ci_lower > null

    # Agreement check
    p_significant = p < alpha

    print("AGREEMENT CHECK")
    print("-" * 40)
    print(f"Null hypothesis value: {null}")
    print(f"Alpha level: {alpha}")
    print(f"Test type: {sided}-sided")
    print()
    print(f"Effect: {diff:.3f}")
    print(f"Standard Error: {se:.3f}")
    print()
    print(f"P-value: {p:.4f}")
    print(f"  Significant (p < {alpha}): {p_significant}")
    print()
    print(f"CI: [{ci[0]:.3f}, {ci[1] if ci[1] != float('inf') else '∞'}]")
    print(f"  Excludes {null}: {ci_excludes_null}")
    print()

    if p_significant == ci_excludes_null:
        print("✓ AGREEMENT: CI and p-value tell the same story")
    else:
        print("⚠ DISAGREEMENT: Check your calculations!")

    return {'p': p, 'ci': ci, 'agree': p_significant == ci_excludes_null}


# Example
check_agreement(diff=5.2, se=2.1, null=0, alpha=0.05, sided='two')

Common Pitfalls

def common_pitfalls():
    """
    Common mistakes that create apparent conflicts.
    """
    print("COMMON PITFALLS")
    print("=" * 50)

    pitfalls = {
        "Eyeballing overlapping CIs": {
            "mistake": "Assuming overlap means not significant",
            "reality": "CIs can overlap and difference still be significant",
            "prevention": "Always calculate CI for the difference directly"
        },
        "Using different software": {
            "mistake": "Getting p-value from one tool, CI from another",
            "reality": "Methods may differ (exact vs. approximate, etc.)",
            "prevention": "Use same tool/method for both"
        },
        "Rounding errors": {
            "mistake": "CI just touching 0, p-value just under 0.05",
            "reality": "Rounding can make them seem to disagree",
            "prevention": "Use more decimal places for comparison"
        },
        "Wrong CI for the question": {
            "mistake": "Testing difference but looking at individual CIs",
            "reality": "Need CI for what you're testing",
            "prevention": "Match CI to the parameter in your hypothesis"
        }
    }

    for name, details in pitfalls.items():
        print(f"\n{name}:")
        print(f"  Mistake: {details['mistake']}")
        print(f"  Reality: {details['reality']}")
        print(f"  Prevention: {details['prevention']}")


common_pitfalls()

R Implementation

# Checking CI/p-value agreement in R

check_agreement <- function(diff, se, null = 0, alpha = 0.05) {
  cat("AGREEMENT CHECK\n")
  cat(rep("=", 40), "\n\n")

  # P-value (two-sided)
  z <- (diff - null) / se
  p <- 2 * (1 - pnorm(abs(z)))

  # 95% CI
  z_crit <- qnorm(1 - alpha/2)
  ci_low <- diff - z_crit * se
  ci_high <- diff + z_crit * se

  # Check
  p_sig <- p < alpha
  ci_excludes <- (ci_low > null) | (ci_high < null)

  cat(sprintf("Effect: %.3f (SE: %.3f)\n", diff, se))
  cat(sprintf("P-value: %.4f (sig at alpha = %.2f: %s)\n", p, alpha, p_sig))
  cat(sprintf("95%% CI: [%.3f, %.3f] (excludes %.1f: %s)\n",
              ci_low, ci_high, null, ci_excludes))
  cat("\n")

  if (p_sig == ci_excludes) {
    cat("✓ AGREEMENT\n")
  } else {
    cat("⚠ DISAGREEMENT - check calculations!\n")
  }
}

# Usage:
# check_agreement(diff = 5.2, se = 2.1)


Key Takeaway

When CIs and p-values seem to disagree, it's almost always because you're comparing mismatched quantities: overlapping group CIs (when you need the CI for the difference), one-sided tests with two-sided CIs, or different alpha levels. For two-sided tests at α = 0.05, a 95% CI that excludes the null ALWAYS corresponds to p < 0.05. When in doubt, compute both using the same method and verify they match—if they don't, something's wrong with your calculation.


References

  1. https://doi.org/10.1136/bmj.d2304
  2. https://www.jstor.org/stable/2683359
  3. Schenker, N., & Gentleman, J. F. (2001). On judging the significance of differences by examining the overlap between confidence intervals. *The American Statistician*, 55(3), 182-186.
  4. Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to read pictures of data. *American Psychologist*, 60(2), 170-180.
  5. Greenland, S., et al. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. *European Journal of Epidemiology*, 31(4), 337-350.

Frequently Asked Questions

Can a 95% CI exclude 0 but p > 0.05?
No, for a two-sided test of H₀: μ = 0. They're mathematically linked. If you see this apparent conflict, check if you're comparing the right CI to the right test.
Why do overlapping confidence intervals sometimes show significant difference?
Individual group CIs are wider than the CI for the difference. Two groups can have overlapping CIs for their means but a significant difference. Don't use overlapping CIs as a test.
What if I have a one-sided test but a two-sided CI?
They won't match at the same alpha level. A 95% two-sided CI corresponds to two one-sided tests at α = 0.025. Use a one-sided CI (90%) to match a one-sided test at α = 0.05.

Key Takeaway

When CIs and p-values seem to disagree, it's usually a comparison mismatch: overlapping group CIs vs. CI for difference, one-sided vs. two-sided tests, or different alpha levels. For two-sided tests against null = 0, a 95% CI that excludes 0 always corresponds to p < 0.05. When in doubt, focus on the CI for the parameter you care about.

Send to a friend

Share this with someone who loves clean statistical work.