Necessary to decide which variables to use in model
“d” stands for “directional”
Usually we are dealing with more than two variables
Complication: causation flows only directed - association might flow against
Code
dagify(z ~ x, y2 ~ z, a ~ x, a ~ y3, x ~ d, y1 ~ d,coords =list(x =c(x =1, z =1.5, y2 =2, a =1.5, y3 =2, d =1.5, y1 =2), y =c(x =1, y2 =1, z =1, a =0, y3 =0, d =2, y1 =2))) %>%tidy_dagitty() %>%ggdag(text_size =3, node_size =5) +geom_dag_edges() +theme_dag() +labs(title="Causal Pitchfork", subtitle ="x and y2 are d-connected but x and y1/y3 are not") +theme(title =element_text(size =8))
Analyzing DAGs: Fork
Code
med <-dagify( x ~ d, y1 ~ d,coords =list(x =c(x =1, z =1.5, y =2, a =1.5, b =2, d =1.5, y1 =2), y =c(x =1, y =1, z =1, a =0, b =0, d =2, y1 =2))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name =="d", "Confounder", "variables of interest")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, aes(color = fill)) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="top") med
d causes both x and y1
Arrows pointing to x are called “back-door” paths
Eliminated by randomized experiment! Why?
Controlling for d “blocks” the non-causal association x \(\rightarrow\) y1
Analyzing DAGs: Pipe
Code
med <-dagify(z ~ x, y2 ~ z,coords =list(x =c(x =1, z =1.5, y2 =2), y =c(x=1, y2 =1, z=1))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name =="z", "Mediator", "variables of interest")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=7, aes(color = fill)) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="top") med
x causes y through z
Controlling for z blocks the causal association x \(\rightarrow\) y2
Analyzing DAGs: Collider
Code
dagify(a ~ x, a ~ y,coords =list(x =c(x =1, y =2, a =1.5), y =c(x =1, y =0, a =0))) |>tidy_dagitty() |>mutate(fill =ifelse(name =="a", "Collider", "variables of interest")) |>ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size =7, aes(color = fill)) +geom_dag_edges(show.legend =FALSE) +geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="top" )
x and y cause a
There is no causal relationship between x and y
There is no correlation between x and y unless we include a
Exercise
Which variables should be included?
Effect of x on y
Effect of z on y
Code
library(ggdag)library(dagitty)library(tidyverse)dagify(y ~ n + z + b + c, x ~ z + a + c, n ~ x, z ~ a + b, exposure ="x", outcome ="y",coords =list(x =c(n =2, x =1, y =3, a =1, z =2, c =2, b =3), y =c(x =2, y =2, a =3, z =3, c =1, b =3, n =2))) %>%tidy_dagitty() %>%ggdag(text_size =8, node_size =12) +geom_dag_edges() +theme_dag()
library(ggpubr)p1 <-dagify(y ~ x + U2, a ~ U1 + U2, x ~ U1,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U1 =1, U2 =2), y =c(x=1, y =1, a =1.5, b =0, U1 =2, U2 =2))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name %in%c("U1", "U2"), "Unobserved", "Observed")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="M-Bias")p2 <-dagify(y ~ a + U, a ~ x + U,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U =1.7, U2 =2), y =c(x=1, y =1, a =1, b =0, U =2, U2 =2))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name %in%c("U"), "Unobserved", "Observed")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="Post-treatment Bias")ggarrange(p1, p2)
Common bad controls
Code
p1 <-dagify(y ~ x , a ~ x + y,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U1 =1, U2 =2), y =c(x=1, y =1, a =1.5, b =0, U1 =2, U2 =2))) %>%tidy_dagitty() %>%#mutate(fill = ifelse(name %in% c("U1", "U2"), "Unobserved", "Observed")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, #aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="Selection Bias")p2 <-dagify(y ~ x , a ~ y,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U1 =1, U2 =2), y =c(x=1, y =1, a =1.5, b =0, U1 =2, U2 =2))) %>%tidy_dagitty() %>%#mutate(fill = ifelse(name %in% c("U1", "U2"), "Unobserved", "Observed")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, #aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="Case-control Bias")ggarrange(p1, p2)
Intelligence, education, income
Case-control study: Observation ex-post. Ex.: Smoking \(\rightarrow\) lung cancer
Exercise
Prepare a short presentation of a (potential) DAG for your thesis
References
Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2020. “A Crash Course in Good and Bad Controls.”SSRN 3689437.
Imbens, Guido W. 2020. “Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics.”Journal of Economic Literature 58 (4): 1129–79. https://doi.org/10.1257/jel.20191597.