Exploration, Inference, and Prediction

Day 7

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 20, 2024

1 Exploration, Prediction, or Inference?

be clear about your objectives before you start analysing your dataset:
- Exploration or description
- Prediction
- Inference: association and causal inference

you cannot do exploration and inference with the same data set

Z. M. Laubach, E. J. Murray, K. L. Hoke, R. J. Safran, and W. Perng, “A biologist’s guide to model selection and causal inference,” Proceedings of the Royal Society B: Biological Sciences, 2021. doi: 10.1098/rspb.2020.2815.

A. T. Tredennick, G. Hooker, S. P. Ellner, and P. B. Adler, “A practical guide to selecting models for exploration, inference, and prediction in ecology,” Ecology, 2021. doi: 10.1002/ecy.3336.

1.1 Prediction vs Inference

\[ \hat y = m \cdot x + b \]

in prediction we care about \(\hat y\) and want to minimize \(\sum (\hat y - y)^2\)
in inference we care about correct values for \(m\) and \(b\)

1.2 Exploration

exploratory data analysis tries to describe and summarize the main characteristics of the data set, see here
no hypothesis testing involved, is used to generate hypothesis
complete data set is used
methods: mean, median, boxplots, histograms, clustering, principal component analysis, nonmetric multidimensional scaling

Example: analysis of vegetation data from Alps excursion

1.3 Prediction

prediction is the process of estimating the value of a variable of interest based on other variables
data set is split into training and test set
selection of variables is based on predictive performance
the main interest is not to understand how predictions are made (e.g. with neural networks or by the use of model averaging) but to make accurate predictions
overfitting is a major problem
methods: random forest, neural networks

There is a large overlap between prediction and machine learning:

M. Pichler and F. Hartig, “Machine learning and deep learning—A review for ecologists,” Methods in Ecology and Evolution, 2023. doi: 10.1111/2041-210x.14061.

Example: occupancy model of Black Kite in Spain under current and future climatic conditions

1.4 Inference

infer properties of the a population by testing hypotheses and derive estimates, observed data is part of a larger population
complete data set is used
be careful with interpreting the estimates in a causal way, for observational data better write “is associated with” instead of “causes” or “leads to”
selection of variables is based on hypothesis
methods: (bayesian) graphical models, structual equation models

1.5 What can we do with (generalized) linear models?

linear models are a very flexible tool for exploration, prediction, and inference
extra care is needed to not mix all three objectives in one analysis

2 Introduction to Causal Inference

not always when you do inference you can interpret the results as causal effects
first, we will do a theoretical introduction and then a simulation experiment to illustrate the problem

2.1 Causal graphs

draw a directed acyclic graph that contains the complete causal structure
confounders influence both \(x\) and \(y\)
colliders are influenced by both \(x\) and \(y\)
mediators are influenced by \(x\) and influence \(y\)

flowchart LR
  A[Variable x] --> M[Mediator] 
  M --> O[Outcome y]
  A ==> O
  C[Confounder] --> O
  C --> A
  A --> D
  O --> D[Collider]
  
style A fill:#f80
style O fill:#f80

3 Simulation experiment: Collider and Confounder Bias

the true effect size of \(x\) on \(y\) is 0.7, we have the following causal structure:

flowchart LR
  X -->|0.7| Y
  Confounder -->|0.14| X
  Confounder -->|0.11| Y
  X -->|0.43| Collider
  Y -->|0.21| Collider

3.1 Generate test data

we generate data that follows this causal structure and we will fit different models to the data and compare the estimated effect size of \(x\) on \(y\) to the true effect size

library(dplyr)
library(ggplot2)
library(piecewiseSEM)

set.seed(123)
n <- 2000
true_effect_size <- 0.7
confounder_var <- rnorm(n)
x_var <- rnorm(n, 0.14*confounder_var, 0.4)
y_var <- rnorm(n, true_effect_size*x_var + 0.11*confounder_var, 0.24)
collider_var <- rnorm(n, 0.43*x_var + 0.21*y_var, 0.22)

df <- tibble(
  x = x_var,
  y = y_var,
  confounder = confounder_var,
  collider = collider_var)

3.2 Include only x → y

Our assumed structure is as follows:

flowchart LR
  X --> Y

lm1 <- lm(y ~ x, data = df)
c1 <- coefs(lm1) # is needed later
coef(lm1)

(Intercept)           x 
0.003660183 0.791415440

if not all confounding factors are included, the estimate is biased.

3.3 Include all variables

Our next idea could be to include all variables as predictors (x, confounder, and collider). Our assumed structure:

flowchart LR
  X --> Y
  Confounder --> Y
  Collider --> Y

lm2 <- lm(y ~ x + confounder + collider, data = df)
c2 <- coefs(lm2) # is needed later
coef(lm2)

(Intercept)           x  confounder    collider 
0.001066547 0.578216064 0.101149710 0.225615265

if a collider is included in the model, the effect of \(x\) on \(y\) is biased

3.4 Include all variables but collider

Our next idea could be to include all variables as predictors but not the collider. Our assumed structure:

flowchart LR
  X --> Y
  Confounder --> Y

lm3 <- lm(y ~ x + confounder, data = df)
c3 <- coefs(lm3)
coef(lm3)

(Intercept)           x  confounder 
0.000406101 0.708802834 0.106130448

3.5 Use a structural equation model

Another option is to specify the complete structure. This model is called a structural equation model (SEM).

mod <- psem(
  lm(y ~ confounder + x, data = df),
  lm(x ~ confounder, data = df),
  lm(collider ~ x + y, data = df)
)
c4 <- coefs(mod, standardize = "none")
c4[2, ]

  Response Predictor Estimate Std.Error   DF Crit.Value P.Value    
2        y         x   0.7088    0.0136 1997    52.2153       0 ***

plot(mod, show = "unstd")

There is a complete course on SEMs taught by Oksana Buzhdygan and Felix May in the winter semester.

3.5 Use a structural equation model

3.6 Summary

Code

df_summary <- tibble(
  estimate = c(c1$Estimate, c2$Estimate[1], c3$Estimate[1], c4$Estimate[2]),
  SE = c(c1$Std.Error, c2$Std.Error[1], c3$Std.Error[1], c4$Std.Error[2]),
  label = factor(
    c("only x", "all variables", "x + confounder", "SEM"),
    levels = c("only x", "all variables", "x + confounder", "SEM"))
)

df_summary %>%
  ggplot() +
    geom_point(aes(estimate, label), size = 3) +
    geom_vline(aes(xintercept = true_effect_size),
               color = "orange", linewidth = 1.5) +
    annotate("text",
             x = true_effect_size, y = 4.3, label = "true effect size",
             color = "orange", size = 7, hjust = -0.05) +
    geom_errorbar(aes(
      xmin = estimate - SE, xmax = estimate + SE,
      y = label), width = 0.0) +
    labs(y = "predictor variables included", x = "Effect size of x → y (± SE)") +
    theme_classic() +
    theme(text = element_text(size = 18))

Further reading: Causal Inference: have you been doing science wrong all this time? implements the same analysis in Python and with a Bayesian perspective

3.7 How to specify the model with `R`

flowchart LR
  A[Variable x] --> O[Outcome y]
  D[Confounder] --> O
  
style A fill:#f80
style O fill:#f80

include your variable of interest, all confounders, but no colliders on the right side of your model formula y ~ x + confounder1 + confounder2
mediators can be omitted, if you are not interested to disentangle the direct and indirect effect of \(x\) on \(y\)

4 Difference between experimental and observational data

Experiments:

we either directly measure confounding variables or we can assume that confounders do not vary between measurements

Observations:

it is hardly possible to measure all confounders
causal inference is difficult

5 How do we come up with a causal structure?

check if there is already a mechanistic understanding of the system
ideas by exploratory data analysis
there is also a field called causal discovery that tries to infer causal structures from data

Exploration, Inference, and Prediction

1 Exploration, Prediction, or Inference?

1.1 Prediction vs Inference

1.2 Exploration

1.3 Prediction

1.4 Inference

1.5 What can we do with (generalized) linear models?

2 Introduction to Causal Inference

2.1 Causal graphs

3 Simulation experiment: Collider and Confounder Bias

3.1 Generate test data

3.2 Include only x → y

3.3 Include all variables

3.4 Include all variables but collider

3.5 Use a structural equation model

3.5 Use a structural equation model

3.6 Summary

3.7 How to specify the model with R

4 Difference between experimental and observational data

5 How do we come up with a causal structure?

3.7 How to specify the model with `R`