Causal Pitchfork - The Data

Data-based Storytelling

Authors

Daniel Winkler

Nils Wlömert

Published

January 4, 2024

Introduction

This document deals with a fundamental question of causal inference: Which variables should be included in a causal model? (see Cinelli, Forney, and Pearl 2020) To answer this question two points need to be clear:

  1. In general each causal model only investigates the causal effect of a single independent variable, \(x_k\), on the dependent variable \(y\). The coefficients associated with all other variables, \(x_{j\neq k}\), cannot (automatically) be interpreted as causal relationships. As regression coefficients are commonly presented in a single table, it is often unclear to the reader which coefficients can be interpreted as causal (see Westreich and Greenland 2013).
  2. Statistical significance (or any other statistical test) does not give us any idea about the causal model. To illustrate this, the following figure shows three statistically significant relationships between the variables \(x\) and \(y\) (all t-stats \(> 9\)). However, by construction there is no causal relationship between them in two of these examples. Even more concerning: In one case the exclusion of a control variable leads to spurious correlation (leftmost plot) while in the other the inclusion of the control variable does the same (rightmost plot).
Code
library(tidyverse)
library(patchwork)
set.seed(11)
## Fork
# n ... number of observations
n <- 500
# d ... binary confounder
d <- rbinom(n, 1, 0.5)
x <- 1.5 * d + rnorm(n)
y <- 0.4 + 2 * d + rnorm(n)
data_fork <- data.frame(x, y, d = factor(d, levels = c(0, 1), labels = c("Yes", "No")))
plt_fork <- ggplot(data_fork, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  ggtitle("Relation due to omitted confounder")
## Pipe
set.seed(11)
x <- 1 * rnorm(n)
z <- rbinom(n, 1, boot::inv.logit(2 * x + rnorm(n)))
y <- 2 * z + rnorm(n)
data_pipe <- data.frame(x, z = factor(z, levels = c(0, 1), labels = c("Yes", "No")), y)
plt_pipe <- ggplot(data_pipe, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  ggtitle("Relation through mediator")
## Collider
set.seed(11)
x <- rnorm(n)
y <- rnorm(n)
a <- rbinom(n, 1, boot::inv.logit(9 * x - 9 * y + rnorm(n)))
data_collider <- data.frame(x, y, a = factor(a, levels = c(0, 1), labels = c("No", "Yes")))
data_collider$x_a <- resid(lm(x ~ 0 + a))
data_collider$y_a <- resid(lm(y ~ 0 + a))
plt_collider <- ggplot(data_collider, aes(x_a, y_a)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(x = "x", y = "y") +
  theme(legend.position = "top") +
  ggtitle("Relation due to included collider")
plt_fork + plt_pipe + plt_collider

The Fork (Good control)

Code
set.seed(42)
library(ggdag)
library(gt)
library(dagitty)
confounder <- dagify(x ~ d, y ~ d,
  coords = list(
    x = c(x = 1, y = 2, d = 1.5),
    y = c(x = 1, y = 2, d = 2)
  )
) |>
  tidy_dagitty() |>
  mutate(fill = ifelse(name == "d", "Confounder", "variables of interest")) |>
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point(size = 7, aes(color = fill)) +
  geom_dag_edges(show.legend = FALSE) +
  geom_dag_text() +
  theme_dag() +
  theme(
    legend.title = element_blank(),
    legend.position = "top"
  )
confounder

A typical dataset with a confounder will exhibit correlation between the treatment \(X\) and outcome \(y.\) This relationship is not causal! In the example below we have a binary confounder \(d\) (Yes/No) that is d-connected with both \(X\) and \(y\) (\(X\) and \(y\) are not d-connected)

Code
set.seed(11)
# n ... number of observations
n <- 500
# d ... binary confounder
d <- rbinom(n, 1, 0.5)
x <- 1.5 * d + rnorm(n)
y <- 0.4 + 2 * d + rnorm(n)
data_fork <- data.frame(x, y, d = factor(d, levels = c(0, 1), labels = c("Yes", "No")))
ggplot(data_fork, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

Code
lm(y ~ x) |>
  broom::tidy() |>
  gt() |>
  fmt_number(
    columns = estimate:p.value,
    decimals = 4
  )
term estimate std.error statistic p.value
(Intercept) 1.0086 0.0671 15.0351 0.0000
x 0.4672 0.0464 10.0576 0.0000

However once we take the confounder into account the association vanishes which reflects the lack of a causal relationship in this case (note that for simplicity the regression lines in the plot are not the same as the model output shown).

Code
# options(scipen = 10)
ggplot(data_fork, aes(x, y, color = d)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(legend.position = "top")

Code
lm(y ~ x * d, data_fork) |>
  broom::tidy() |>
  gt() |>
  fmt_number(
    columns = estimate:p.value,
    decimals = 4
  )
term estimate std.error statistic p.value
(Intercept) 0.4426 0.0633 6.9879 0.0000
x 0.0526 0.0617 0.8519 0.3947
dNo 1.9350 0.1364 14.1815 0.0000
x:dNo −0.0616 0.0914 −0.6741 0.5006

The Pipe (Bad control)

Code
med <- dagify(z ~ x, y ~ z,
  coords = list(x = c(x = 1, z = 1.5, y = 2), y = c(x = 1, y = 1, z = 1))
) |>
  tidy_dagitty() |>
  mutate(fill = ifelse(name == "z", "Mediator", "variables of interest")) |>
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point(size = 7, aes(color = fill)) +
  geom_dag_edges(show.legend = FALSE) +
  geom_dag_text() +
  theme_dag() +
  theme(
    legend.title = element_blank(),
    legend.position = "top"
  )
med

If we have a mediator in our data the picture looks very similar to the previous one. In addition, taking the mediator into account also has a similar effect: we remove the association between \(X\) and \(y\). However, in this case that is not what we want since \(X\) and \(y\) are d-connected. \(X\) causes \(y\) through \(z\) (note that for simplicity the regression lines in the second plot are not the same as the model output shown).

Code
set.seed(11)
x <- 1 * rnorm(n)
z <- rbinom(n, 1, boot::inv.logit(2 * x + rnorm(n)))
y <- 2 * z + rnorm(n)
data_pipe <- data.frame(x, z = factor(z, levels = c(0, 1), labels = c("Yes", "No")), y)
ggplot(data_pipe, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

Code
lm(y ~ x) |>
  broom::tidy() |>
  gt() |>
  fmt_number(
    columns = estimate:p.value,
    decimals = 4
  )
term estimate std.error statistic p.value
(Intercept) 0.9028 0.0586 15.4170 0.0000
x 0.5804 0.0593 9.7944 0.0000
Code
ggplot(data_pipe, aes(x, y, color = z)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(legend.position = "top")

Code
lm(y ~ x * z) |>
  broom::tidy() |>
  gt() |>
  fmt_number(
    columns = estimate:p.value,
    decimals = 4
  )
term estimate std.error statistic p.value
(Intercept) −0.0393 0.0755 −0.5205 0.6029
x −0.0185 0.0787 −0.2349 0.8143
z 2.0562 0.1137 18.0855 0.0000
x:z −0.0324 0.1146 −0.2826 0.7776

The Collider (Bad control)

Code
dagify(a ~ x, a ~ y,
  coords = list(x = c(x = 1, y = 2, a = 1.5), y = c(x = 1, y = 0,  a = 0))
) |>
  tidy_dagitty() |>
  mutate(fill = ifelse(name == "a", "Collider", "variables of interest")) |>
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point(size = 7, aes(color = fill)) +
  geom_dag_edges(show.legend = FALSE) +
  geom_dag_text() +
  theme_dag() +
  theme(
    legend.title = element_blank(),
    legend.position = "top"
  )

The collider is a special case. There is no association between \(X\) and \(y\) as long as we do not account for the collider in the model. However, by accounting for the collider we implicitly learn about \(y\) as well (we use \(X\) as the predictor). Since the collider is caused by \(X\) and \(y\), we can figure out what \(y\) must be once we know \(X\) and the collider similar to solving a simple equation you would see in high-school.

Code
set.seed(11)
x <- rnorm(n)
y <- rnorm(n)
a <- rbinom(n, 1, boot::inv.logit(9 * x - 9 * y + rnorm(n)))
data_collider <- data.frame(x, y, a = factor(a, levels = c(0, 1), labels = c("No", "Yes")))
ggplot(data_collider, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

Code
lm(y ~ x) |>
  broom::tidy() |>
  gt() |>
  fmt_number(columns = estimate:p.value, decimals = 4)
term estimate std.error statistic p.value
(Intercept) 0.0203 0.0449 0.4519 0.6515
x 0.0307 0.0455 0.6754 0.4997
Code
data_collider$x_a <- resid(lm(x ~ 0 + a))
data_collider$y_a <- resid(lm(y ~ 0 + a))
ggplot(data_collider, aes(x_a, y_a)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(x = "x after accounting for a", y = "y after accounting for a") +
  theme(legend.position = "top")

Code
lm(y ~ x + a, data_collider) |>
  broom::tidy() |>
  gt() |>
  fmt_number(columns = estimate:p.value, decimals = 4)
term estimate std.error statistic p.value
(Intercept) 0.8614 0.0538 16.0142 0.0000
x 0.5328 0.0422 12.6293 0.0000
aYes −1.6663 0.0834 −19.9822 0.0000

Appendix: How the sausage is made

The fork, mediator, and collider were generated as binary variables to make visualization easier. Binary variables can be drawn from a so-called Bernoulli distribution which is a special case of the binomial distribution with size = 1. The distribution takes the probability of getting a \(1\) as input.

The Fork

## Make code reproducible
set.seed(11)
## Number of observations
n <- 1500
## Random draw from Bernoulli with p(1) = 0.5, p(0) = 0.5
d <- rbinom(n, 1, 0.5)
## X is caused by d
x <- 2 * d + rnorm(n)
## y is caused by d
y <- 0.4 + 2.5 * d + rnorm(n)
fork <- data.frame(x, y, d = factor(d,
  levels = c(0, 1),
  labels = c("No", "Yes")
))
ggplot(fork, aes(x, y, color = d)) +
  geom_point() +
  theme_minimal() +
  theme(legend.position = "top")

The Pipe

Code
## Generate random X
x <- rnorm(n)
## inv.logit ensures that values are between 0 and 1
ggplot(data.frame()) +
  stat_function(fun = boot::inv.logit, xlim = c(-10, 10)) +
  theme_minimal() +
  labs(title = "Inverse Logit function", x = "x", y = "inv.logit(x)")

Code
## z is caused by X
z <- rbinom(n, 1, boot::inv.logit(2 * x + rnorm(n)))
## y is caused by z
y <- 2 * z + rnorm(n)
pipe <- data.frame(x, y, z = factor(z,
  levels = c(0, 1),
  labels = c("Yes", "No")
))
ggplot(pipe, aes(x, y, color = z)) +
  geom_point() +
  theme_minimal() +
  theme(legend.position = "top")

The Collider

Code
## Generate random x
x <- rnorm(n)
## Generate random y
y <- rnorm(n)
## a is caused by both X and y
a <- rbinom(n, 1, boot::inv.logit(9 * x - 9 * y + rnorm(n)))
collider <- data.frame(x, y, a = factor(a,
  levels = c(0, 1),
  labels = c("No", "Yes")
))
ggplot(collider, aes(x, y)) +
  geom_point() +
  theme_minimal()

In order to get the partial correlation of \(X\) and \(y\) after accounting for \(a\) we first regress both \(X\) and \(y\) on \(a\) and use the unexplained part (residual) in the plot. This is equivalent to a regression that has both \(X\) and \(a\) as explanatory variables.

Code
collider$x_a <- residuals(lm(x ~ 0 + a))
collider$y_a <- residuals(lm(y ~ 0 + a))
ggplot(collider, aes(x_a, y_a)) +
  geom_point() +
  theme_minimal() +
  labs(x = "x after accounting for a", y = "y after accounting for a")

References

Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2020. “A Crash Course in Good and Bad Controls.” SSRN 3689437.
Westreich, Daniel, and Sander Greenland. 2013. “The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients.” American Journal of Epidemiology 177 (4): 292–98.