Reporting

When to Say 'Inconclusive': Decision Rules That Build Trust

Knowing when to call an experiment inconclusive is a skill. Learn decision frameworks for ambiguous results that maintain credibility and enable good business decisions.

Share

Quick Hits

  • Inconclusive isn't failure—it's an honest answer when the data doesn't speak clearly
  • The CI includes both meaningful positive and meaningful negative? That's inconclusive.
  • Don't spin inconclusive as positive; don't force a narrative the data doesn't support
  • Options for inconclusive: extend, ship with risk acknowledgment, or abandon
  • Saying 'I don't know' when appropriate builds more trust than false certainty

TL;DR

Inconclusive is a legitimate conclusion, not a failure. Call results inconclusive when the confidence interval includes both meaningful positive and negative effects—the data genuinely doesn't tell you which way it goes. Don't spin inconclusive as positive ("directionally promising!") or dismiss it as negative ("probably doesn't work"). Present options: extend the experiment, ship accepting uncertainty, or abandon. Honest inconclusive calls build trust; forced narratives destroy it.


What "Inconclusive" Actually Means

The Definition

Inconclusive = The data is consistent with multiple importantly different outcomes.

         ← Negative →   ← No Effect →   ← Positive →

Conclusive (Negative):   |---CI---|
                                   ↑ Zero

Conclusive (Positive):                          |---CI---|
                                   ↑ Zero

Inconclusive:                |-------CI-------|
                                   ↑ Zero

The Key Distinction

Result Type Confidence Interval What You Know
Conclusive positive Entirely above zero Effect is definitely positive
Conclusive negative Entirely below zero Effect is definitely negative
Conclusive null Narrow, centered on zero Effect is definitely small/zero
Inconclusive Wide, includes zero Could be positive, negative, or zero

When to Call It Inconclusive

Decision Framework

def classify_result(ci_lower, ci_upper, mde, zero=0):
    """
    Classify experiment result as conclusive or inconclusive.

    Parameters:
    -----------
    ci_lower : float
        Lower bound of confidence interval
    ci_upper : float
        Upper bound of confidence interval
    mde : float
        Minimum detectable/important effect
    zero : float
        The null hypothesis value (usually 0)
    """
    # Conclusive positive: CI entirely above zero
    if ci_lower > zero:
        if ci_lower > mde:
            return "CONCLUSIVE: Strong positive (CI above MDE)"
        else:
            return "CONCLUSIVE: Positive (CI above zero but includes below MDE)"

    # Conclusive negative: CI entirely below zero
    if ci_upper < zero:
        return "CONCLUSIVE: Negative (CI entirely below zero)"

    # Now CI includes zero - is it inconclusive or conclusive null?
    ci_width = ci_upper - ci_lower

    # If CI is narrow and centered near zero, it's a conclusive null
    if ci_width < mde and abs(ci_lower + ci_upper) / 2 < mde / 2:
        return "CONCLUSIVE: No meaningful effect (narrow CI around zero)"

    # Otherwise, it's inconclusive
    if ci_lower < -mde / 2 and ci_upper > mde / 2:
        return "INCONCLUSIVE: CI includes both meaningful positive and negative"
    elif ci_lower < zero < ci_upper:
        return "INCONCLUSIVE: CI includes zero; direction uncertain"
    else:
        return "BORDERLINE: Close to conclusive but not quite"


# Examples
print("Result Classification Examples")
print("=" * 60)

examples = [
    ("Clear positive", 0.02, 0.06, 0.02),
    ("Clear negative", -0.05, -0.01, 0.02),
    ("Inconclusive (wide)", -0.02, 0.04, 0.02),
    ("Inconclusive (includes zero)", -0.01, 0.03, 0.02),
    ("Conclusive null (narrow)", -0.005, 0.005, 0.02),
    ("Positive but below MDE", 0.005, 0.015, 0.02),
]

for name, ci_l, ci_u, mde in examples:
    result = classify_result(ci_l, ci_u, mde)
    print(f"\n{name}:")
    print(f"  CI: [{ci_l:+.1%}, {ci_u:+.1%}], MDE: {mde:.1%}")
    print(f"  → {result}")

The Critical Questions

Ask yourself:

  1. Does the CI include both positive and negative effects I'd care about?

    • If yes → likely inconclusive
  2. Is the CI narrow enough to rule out meaningful effects?

    • If narrow and near zero → conclusive null
    • If wide → inconclusive
  3. Could I confidently recommend action based on this result?

    • If no → probably inconclusive

What NOT to Do with Inconclusive Results

Don't Spin Positive

The Temptation: "While not statistically significant, the results are directionally positive, suggesting..."

Why It's Wrong: If the CI includes zero and negative values, the data is equally consistent with a negative effect. "Directionally positive" implies more certainty than exists.

Better: "The point estimate is positive, but the confidence interval includes both positive and negative effects. We cannot conclude the treatment helped."

Don't Spin Negative

The Temptation: "The experiment showed no significant effect, indicating the treatment doesn't work."

Why It's Wrong: "Not significant" with a wide CI doesn't mean "no effect." It means "we don't know."

Better: "We did not detect a significant effect. However, the confidence interval is wide enough that we also cannot rule out a meaningful positive effect."

Don't Cherry-Pick

The Temptation: "Overall results were inconclusive, but mobile users showed a significant improvement!"

Why It's Wrong: Post-hoc segment mining after inconclusive overall results is classic p-hacking.

Better: Report the overall inconclusive result as primary. Mention the segment finding as exploratory, requiring replication.


How to Present Inconclusive Results

The Honest Framework

## Result: Inconclusive

### What We Observed
- Point estimate: +2.3% lift
- 95% CI: -1.5% to +6.1%
- p-value: 0.23

### What This Means
The confidence interval includes:
- Zero (no effect)
- Values up to our MDE of 2% (small positive effect)
- Values above our MDE (meaningful positive effect)
- Negative values (potential harm)

**In plain language**: The data is consistent with the treatment helping, hurting, or doing nothing. We cannot tell which.

### Why This Happened
We achieved 62% of planned sample size due to the feature freeze.
At current sample size, we have ~45% power to detect our MDE.

### Options

**Option A: Extend experiment 3 weeks**
- Pro: Likely conclusive result
- Con: 3-week delay
- Probability of each outcome (estimated):
  - Conclusive positive: ~35%
  - Conclusive negative: ~15%
  - Still inconclusive: ~50%

**Option B: Ship now**
- Pro: No delay, capture possible upside
- Con: ~25% probability effect is actually negative
- Risk: Limited; worst case appears to be small harm

**Option C: Abandon and reallocate**
- Pro: Free up resources immediately
- Con: Miss potential positive effect (35% chance it's real)

### My Recommendation
[Option A/B/C] because [reasoning based on business context]

Visualizing Inconclusive Results

import matplotlib.pyplot as plt
import numpy as np


def visualize_inconclusive(point_est, ci_lower, ci_upper, mde):
    """
    Create a clear visualization of an inconclusive result.
    """
    fig, ax = plt.subplots(figsize=(10, 4))

    # Reference lines
    ax.axvline(x=0, color='gray', linestyle='-', linewidth=2, label='No effect')
    ax.axvline(x=mde, color='green', linestyle='--', linewidth=2, label=f'MDE (+{mde:.1%})')
    ax.axvline(x=-mde, color='red', linestyle='--', linewidth=2, label=f'MDE (-{mde:.1%})')

    # Confidence interval
    ax.barh(0, ci_upper - ci_lower, left=ci_lower, height=0.3,
            color='steelblue', alpha=0.4, label='95% CI')

    # Point estimate
    ax.scatter(point_est, 0, color='steelblue', s=150, zorder=5,
               label=f'Point estimate ({point_est:+.1%})')

    # Annotations
    ax.annotate('Harm zone', xy=(-mde/2, 0.25), ha='center', fontsize=10, color='red')
    ax.annotate('Help zone', xy=(mde*1.5, 0.25), ha='center', fontsize=10, color='green')
    ax.annotate('Uncertain zone', xy=(0, 0.25), ha='center', fontsize=10, color='gray')

    # Formatting
    ax.set_xlim(-0.08, 0.10)
    ax.set_ylim(-0.5, 0.5)
    ax.set_xlabel('Effect Size')
    ax.set_yticks([])
    ax.legend(loc='upper right', fontsize=9)
    ax.set_title('Inconclusive Result: CI Spans Multiple Outcome Zones', fontsize=12)

    plt.tight_layout()
    return fig


# Example
# visualize_inconclusive(0.023, -0.015, 0.061, 0.02)
# plt.show()

Decision Rules for Stakeholders

Pre-Specified Decision Framework

Define this before the experiment runs:

## Pre-Specified Decision Rules

### Conclusive Positive
**Condition**: CI lower bound > 0
**Action**: Ship to 100%

### Strong Positive
**Condition**: CI lower bound > MDE
**Action**: Ship with high confidence

### Conclusive Negative
**Condition**: CI upper bound < 0
**Action**: Roll back immediately

### Conclusive Null
**Condition**: CI is [-1%, +1%] (narrow, centered on zero)
**Action**: No meaningful effect; decide based on other factors

### Inconclusive
**Condition**: CI includes zero AND meaningful positive/negative
**Action**: Choose from options:
  - Extend if: High stakes decision, time available
  - Ship if: Low risk, directionally positive
  - Abandon if: Opportunity cost high, signal weak

The Risk-Based Decision Matrix

def inconclusive_decision_matrix(ci_lower, ci_upper, mde,
                                   business_stakes, time_pressure):
    """
    Recommend action for inconclusive result based on context.
    """
    # Calculate probability estimates (rough)
    ci_width = ci_upper - ci_lower
    point_est = (ci_lower + ci_upper) / 2

    # Probability effect is positive (rough approximation)
    prob_positive = max(0, min(1, (ci_upper / ci_width)))

    # Probability effect is meaningfully positive (> MDE)
    prob_meaningful = max(0, min(1, (ci_upper - mde) / ci_width)) if ci_upper > mde else 0

    # Decision logic
    if business_stakes == "high" and time_pressure == "low":
        recommendation = "EXTEND"
        rationale = "High stakes justify waiting for clarity"
    elif business_stakes == "low" and prob_positive > 0.6:
        recommendation = "SHIP"
        rationale = f"Low stakes + {prob_positive:.0%} chance of positive effect"
    elif prob_meaningful < 0.2 and point_est < mde / 2:
        recommendation = "ABANDON"
        rationale = f"Only {prob_meaningful:.0%} chance of meaningful effect"
    elif time_pressure == "high" and point_est > 0:
        recommendation = "SHIP (with monitoring)"
        rationale = "Time pressure + directionally positive"
    else:
        recommendation = "EXTEND or ABANDON"
        rationale = "Judgment call based on opportunity cost"

    return {
        'recommendation': recommendation,
        'rationale': rationale,
        'prob_positive': prob_positive,
        'prob_meaningful': prob_meaningful
    }


# Example
result = inconclusive_decision_matrix(
    ci_lower=-0.015,
    ci_upper=0.061,
    mde=0.02,
    business_stakes="medium",
    time_pressure="low"
)

print("Inconclusive Result Decision")
print("=" * 40)
print(f"Recommendation: {result['recommendation']}")
print(f"Rationale: {result['rationale']}")
print(f"P(positive): {result['prob_positive']:.0%}")
print(f"P(meaningful): {result['prob_meaningful']:.0%}")

Common Scenarios

Scenario 1: Wide CI, Directionally Positive

Result: +3.1% lift, CI: -2.0% to +8.2%, p = 0.23

Wrong approach: "While not significant, results are promising..."

Right approach:

## Inconclusive: Wide CI Prevents Conclusion

The point estimate (+3.1%) is positive, but the confidence interval
(-2.0% to +8.2%) is too wide to draw conclusions.

**The data is equally consistent with**:
- A meaningful positive effect (+8%)
- No effect (0%)
- A small negative effect (-2%)

**Options**:
1. Extend 2 weeks to narrow the CI
2. Ship accepting the ~20% chance of small negative effect
3. Abandon if opportunity cost is high

Scenario 2: Near-Significant, Below MDE

Result: +1.8% lift, CI: +0.1% to +3.5%, p = 0.04

Wrong approach: "Significant positive effect!"

Right approach:

## Conclusive Positive, But Below MDE

The effect is statistically significant (CI excludes zero), but small.
Our pre-specified MDE was 2%; the point estimate (1.8%) is below this.

**What we know**:
- Effect is positive (CI entirely above zero)
- Effect is likely 0.1% to 3.5%
- Median estimate (1.8%) is below our target

**Decision**: Do we ship a feature with ~1.8% lift?
This is a business question, not a statistics question.
The data says the effect is real but probably small.

Scenario 3: Underpowered, High Variance Metric

Result: +$0.45/user, CI: -$1.20 to +$2.10, p = 0.58

Wrong approach: "Revenue was not significantly affected."

Right approach:

## Inconclusive: Insufficient Precision for Revenue

We could not detect a significant revenue effect. However, the wide
CI (-$1.20 to +$2.10) means we also cannot rule out meaningful
positive or negative effects.

**Why so uncertain?**
Revenue is high-variance. Our sample size provided adequate power
for conversion (detected +2.1% lift) but not for revenue.

**Options**:
1. Rely on conversion result (significant positive) for decision
2. Run longer specifically to measure revenue impact
3. Accept revenue uncertainty; monitor post-launch

Building Trust Through Honest Inconclusive Calls

Why Honesty Pays Off

Short-Term Long-Term
"Results inconclusive" feels disappointing "This analyst tells the truth"
Stakeholders wanted a clear answer Stakeholders trust your analysis
Pressure to spin positive No cleanup after over-promising
One ambiguous experiment Reputation for integrity

The Credibility Flywheel

Tell the truth about inconclusive results
              ↓
Stakeholders learn you're honest
              ↓
Future positive results are believed
              ↓
Your recommendations carry weight
              ↓
You're asked for input on important decisions
              ↓
You tell the truth about inconclusive results
              ↓
(credibility compounds)

R Implementation

# Function to classify and present inconclusive results
present_result <- function(point_est, ci_lower, ci_upper, mde) {
  # Classify
  if (ci_lower > 0) {
    classification <- "CONCLUSIVE POSITIVE"
  } else if (ci_upper < 0) {
    classification <- "CONCLUSIVE NEGATIVE"
  } else if (ci_upper - ci_lower < mde && abs(point_est) < mde/2) {
    classification <- "CONCLUSIVE NULL"
  } else {
    classification <- "INCONCLUSIVE"
  }

  # Present
  cat("Result Classification:", classification, "\n")
  cat(paste(rep("=", 50), collapse = ""), "\n\n")

  cat("Point estimate:", sprintf("%.1f%%\n", point_est * 100))
  cat("95% CI: [", sprintf("%.1f%%", ci_lower * 100), ", ",
      sprintf("%.1f%%", ci_upper * 100), "]\n\n", sep = "")

  if (classification == "INCONCLUSIVE") {
    cat("The confidence interval includes:\n")
    if (ci_lower < 0) cat("  - Negative effects (potential harm)\n")
    cat("  - Zero (no effect)\n")
    if (ci_upper > mde) cat("  - Effects above MDE (meaningful help)\n")

    cat("\nThis result cannot distinguish between help, harm, or no effect.\n")
    cat("\nOptions:\n")
    cat("  1. Extend experiment for clarity\n")
    cat("  2. Ship accepting uncertainty\n")
    cat("  3. Abandon and reallocate\n")
  }
}

# Example
present_result(
  point_est = 0.023,
  ci_lower = -0.015,
  ci_upper = 0.061,
  mde = 0.02
)


Key Takeaway

Inconclusive is a valid, valuable conclusion—not a failure. Call results inconclusive when the confidence interval includes both meaningful positive and meaningful negative effects. Don't spin it as positive ("directionally good!") or negative ("probably doesn't work"). Present it honestly: "We can't rule out no effect, but we also can't rule out a positive effect. Here are our options: extend, ship with risk acknowledgment, or abandon." This builds trust because stakeholders learn you'll tell them the truth rather than what they want to hear. Over time, that credibility is worth more than any single inflated finding.


References

  1. https://doi.org/10.1177/0956797611417632
  2. https://www.microsoft.com/en-us/research/publication/top-challenges-from-the-first-practical-online-controlled-experiments-summit/
  3. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis. *Psychological Science*, 22(11), 1359-1366.
  4. Kohavi, R., Tang, D., & Xu, Y. (2020). *Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing*. Cambridge University Press.
  5. Greenland, S., et al. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. *European Journal of Epidemiology*, 31(4), 337-350.

Frequently Asked Questions

Won't calling results 'inconclusive' make me look bad?
The opposite. Forcing a narrative from ambiguous data and being wrong makes you look bad. Honestly saying 'the data doesn't give us a clear answer' and presenting options shows integrity and expertise. Stakeholders respect honesty.
How do I present inconclusive results without losing momentum?
Present it as useful information: 'We learned the effect isn't large enough to detect easily. Options: extend to get clarity, ship accepting the uncertainty, or reallocate resources.' Inconclusive still informs decisions—it tells you the effect isn't obviously big or obviously bad.
At what point is 'not significant' actually 'no effect'?
When the CI is narrow and centered on zero, you can say 'we're confident there's no meaningful effect.' When the CI is wide and includes both positive and negative meaningful values, say 'inconclusive.' The width matters as much as where zero falls.

Key Takeaway

Inconclusive is a valid, valuable conclusion—not a failure. Call results inconclusive when the confidence interval includes both meaningful positive and meaningful negative effects. Don't spin it as positive ('directionally good!') or negative ('probably doesn't work'). Present it honestly: 'We can't rule out no effect, but we also can't rule out a positive effect. Here are our options.' This builds trust because stakeholders learn you'll tell them the truth, not what they want to hear. Over time, that credibility is worth more than any single inflated finding.

Send to a friend

Share this with someone who loves clean statistical work.