Causal Inference

DAGs for Analysts: Drawing Your Assumptions Before You Analyze

How directed acyclic graphs help analysts identify confounders, avoid collider bias, and choose the right variables to control for in causal analysis.

Jan 298 min readstatstest_flow Causal Inference Supporting

DAGs for Analysts: Drawing Your Assumptions Before You Analyze

Quick Hits

•A DAG is a diagram of your causal assumptions: nodes are variables, directed edges represent causal relationships, and no cycles are allowed
•DAGs make implicit assumptions explicit, forcing you to commit to a causal story before looking at results
•The backdoor criterion tells you exactly which variables to adjust for -- and which to leave alone -- to get an unbiased causal estimate
•Conditioning on a collider (a variable caused by two others) opens a spurious path and creates bias -- one of the most counterintuitive results in statistics
•You do not need to believe your DAG is perfectly correct; even an approximate DAG is better than no structural thinking at all

TL;DR

Before you run any causal analysis, you should draw a picture. A directed acyclic graph (DAG) maps out your assumptions about which variables cause which. This is not a formality -- it is the step that determines which variables you should control for, which you must leave alone, and whether your causal estimate is identified at all. This post explains how to build a DAG, how to read it, how to apply the backdoor criterion, and how to avoid the collider trap that catches even experienced analysts.

Why You Need a DAG

Every causal analysis makes assumptions about the data-generating process. When you "control for" a variable in a regression, you are making a causal claim: this variable is a confounder that needs to be adjusted for. When you leave a variable out, you are claiming it is either irrelevant or already accounted for.

The problem is that these assumptions are usually implicit. An analyst includes covariates based on habit, availability, or a vague sense that "more controls are better." But controlling for the wrong variable can introduce bias rather than remove it. A DAG makes all of these decisions explicit and auditable.

The payoff:

A DAG tells you the minimal sufficient set of variables to adjust for.
It warns you when adjusting for a variable would create bias (collider bias).
It reveals when causal identification is impossible regardless of your statistical method.
It facilitates communication: colleagues can critique your assumptions by looking at the graph, not your code.

Building a DAG

Nodes and Edges

Each node is a variable: treatment, outcome, covariates, unobserved variables.
Each directed edge (arrow) represents a direct causal effect. $A \to B$ means $A$ causes $B$ .
No cycles: If $A$ causes $B$ , $B$ cannot cause $A$ (at least not in the same time period). This is the "acyclic" constraint.

Steps to Construct a DAG

Define treatment and outcome. These are the two nodes you care about most.
List plausible causes of the outcome. What variables affect $Y$ apart from $T$ ?
List plausible causes of the treatment. What determines who receives $T$ ?
Identify common causes. Variables that cause both $T$ and $Y$ are confounders.
Add edges based on domain knowledge. Does variable $A$ plausibly cause $B$ ? If so, draw the arrow.
Include unobserved variables. If you believe an unmeasured variable exists, include it as a node (often drawn with a dashed circle). This makes the threat explicit.

A Product Analytics Example

You want to estimate the effect of a "power user" badge on purchase frequency.

Nodes: Badge (T), Purchase Frequency (Y), Prior Engagement (C1), Account Age (C2), User Motivation (U, unobserved).

User Motivation (U) --> Prior Engagement (C1)
User Motivation (U) --> Purchase Frequency (Y)
Prior Engagement (C1) --> Badge (T)
Account Age (C2) --> Prior Engagement (C1)
Account Age (C2) --> Purchase Frequency (Y)
Badge (T) --> Purchase Frequency (Y)

This DAG tells you:

Prior Engagement and Account Age are confounders. They create backdoor paths from Badge to Purchase Frequency.
User Motivation is an unmeasured confounder. If it exists and is strong, no amount of adjusting for observed variables will fully remove bias.
You need to adjust for Prior Engagement and Account Age to block the backdoor paths through them.

Reading a DAG: Paths, Blocking, and d-Separation

Paths

A path is any sequence of edges connecting two nodes, regardless of direction. Between $T$ and $Y$ , there may be:

Causal paths: All edges point from $T$ toward $Y$ (the front-door path). These are the effect you want to measure.
Backdoor paths: At least one edge points into $T$ . These create non-causal associations (confounding).

Blocking a Path

A path is blocked if:

It contains a non-collider that you condition on (adjust for).
It contains a collider that you do not condition on (and no descendant of the collider is conditioned on).

d-Separation

Two variables are d-separated (conditionally independent) given a set $S$ if every path between them is blocked by $S$ . If $T$ and $Y$ are d-separated by $S$ on all non-causal paths, then conditioning on $S$ removes confounding.

The Backdoor Criterion

Pearl's backdoor criterion provides a testable rule for choosing adjustment sets.

A set $S$ satisfies the backdoor criterion relative to treatment $T$ and outcome $Y$ if:

No variable in $S$ is a descendant of $T$ .
$S$ blocks every path between $T$ and $Y$ that contains an arrow into $T$ .

If you find such a set and can measure all variables in it, you can estimate the causal effect by adjusting for $S$ .

Practical tip: Use DAGitty (dagitty.net) to draw your DAG and let the software compute the minimal sufficient adjustment set. This avoids manual errors and handles complex graphs.

The Collider Trap

A collider on a path is a variable where two arrows converge: $A \to D \gets B$ . On this path, $D$ is a collider.

Key rule: Conditioning on a collider opens a path that was previously blocked. This is the opposite of what happens with confounders, and it is deeply counterintuitive.

Example: The Talent-Looks Collider

Suppose being hired at a company ( $D$ ) requires either strong talent ( $A$ ) or good looks ( $B$ ). Among all people, talent and looks are unrelated. But among employees (conditioning on being hired), a negative association emerges: if someone was hired despite average looks, they must be talented. This is collider bias.

In Product Analytics

Suppose you are studying whether a new recommendation algorithm ( $T$ ) improves purchases ( $Y$ ). Both the algorithm and inherent user interest ( $Y$ ) drive clicks ( $D$ ). If you control for clicks (a collider), you create a spurious association between the algorithm and interest, biasing your estimate.

Rule of thumb: Never control for a variable that is a descendant (consequence) of both the treatment and the outcome, unless you have a specific structural reason to do so.

Unobserved Variables and the Limits of DAGs

DAGs include unobserved variables as nodes. An unobserved confounder creates a backdoor path that cannot be blocked by conditioning. When this happens, the DAG tells you that standard adjustment methods are insufficient and you need a different strategy:

Instrumental variables to bypass unmeasured confounders.
Regression discontinuity if a threshold exists.
Sensitivity analysis to bound the bias.

A DAG that shows an unblockable backdoor path is not a failure -- it is an honest assessment that prevents you from producing a biased estimate and calling it causal.

DAGs in Practice: A Workflow

Before touching data: Convene your team and draw the DAG on a whiteboard. Include treatment, outcome, all plausible confounders, and any unobserved variables.
Identify adjustment sets. Use the backdoor criterion (manually or with DAGitty) to determine what to control for.
Check for colliders. Verify that no variable in your adjustment set is a collider or descendant of a collider.
Assess identification. If an unblockable backdoor path exists, standard adjustment won't work. Consider alternative methods.
Document the DAG. Include it in your analysis report. Colleagues should be able to see your causal assumptions and critique them.
Sensitivity analysis. If unobserved confounders are plausible, quantify how much bias they would need to create to change your conclusion.

Common Mistakes

"Kitchen sink" regression. Including every available variable as a control without a DAG. You will almost certainly condition on colliders or mediators, introducing bias.

Confusing prediction with causal inference. A model that predicts $Y$ well is not the same as a model that estimates the causal effect of $T$ on $Y$ . Good predictors can be colliders or mediators.

Ignoring unobserved variables. A DAG with only observed variables gives false confidence. Include unobserved nodes to represent threats.

Static DAGs for dynamic processes. If variables change over time and cause each other at different time points (time-varying confounding), a standard DAG may not capture the full structure. Consider time-indexed DAGs or causal models for longitudinal data.

Treating the DAG as ground truth. A DAG is a model of your assumptions, not a fact about the world. It can be wrong. But a wrong DAG that you can criticize and improve is better than no DAG and unchecked implicit assumptions.

Summary

DAGs are the single most useful pre-analysis tool for causal inference. They force you to make assumptions explicit, they provide a formal criterion for choosing control variables, and they protect against the counterintuitive trap of collider bias. Drawing a DAG takes 15 minutes. Not drawing one can cost you months of biased analysis.

For the full causal inference toolkit and how each method relates to the DAG framework, see our causal inference overview.

References

https://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf
https://www.dagitty.net/
https://academic.oup.com/ije/article/31/1/163/655748

Frequently Asked Questions

Do I need to know the exact causal structure to draw a DAG?

No. A DAG represents your best understanding of the causal structure based on domain knowledge, prior research, and logical reasoning. It will never be perfectly correct. The value is in making your assumptions explicit so they can be discussed, challenged, and tested where possible. An approximate DAG is far better than implicitly assuming no confounding.

What software can I use to draw and analyze DAGs?

DAGitty (dagitty.net) is a free web tool that lets you draw DAGs and automatically identifies valid adjustment sets using the backdoor criterion. In R, the ggdag and dagitty packages provide similar functionality. In Python, the DoWhy library supports causal graphs. For quick sketches, even pen-and-paper works -- the thinking matters more than the tool.

How do I know if I should include a variable as a node in my DAG?

Include variables that you believe causally affect either the treatment, the outcome, or both. Also include any variable you are considering controlling for. Exclude variables that are purely descriptive or have no plausible causal role. When in doubt, include it -- a DAG with too many nodes is better than one missing an important confounder.

Key Takeaway

DAGs are the essential pre-analysis step for any causal inference study. They force you to state your causal assumptions explicitly, reveal which variables to adjust for, and protect against the counterintuitive pitfall of collider bias.

Send to a friend

Share this with someone who loves clean statistical work.

Facebook X Reddit LinkedIn Email