Exploration, Inference, and Prediction

Day 7

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 20, 2024

1 Exploration, Prediction, or Inference?

  • be clear about your objectives before you start analysing your dataset:
    • Exploration or description
    • Prediction
    • Inference: association and causal inference
  • you cannot do exploration and inference with the same data set

Z. M. Laubach, E. J. Murray, K. L. Hoke, R. J. Safran, and W. Perng, “A biologist’s guide to model selection and causal inference,” Proceedings of the Royal Society B: Biological Sciences, 2021. doi: 10.1098/rspb.2020.2815.

A. T. Tredennick, G. Hooker, S. P. Ellner, and P. B. Adler, “A practical guide to selecting models for exploration, inference, and prediction in ecology,” Ecology, 2021. doi: 10.1002/ecy.3336.

1.1 Prediction vs Inference

\[ \hat y = m \cdot x + b \]

  • in prediction we care about \(\hat y\) and want to minimize \(\sum (\hat y - y)^2\)
  • in inference we care about correct values for \(m\) and \(b\)

1.2 Exploration

  • exploratory data analysis tries to describe and summarize the main characteristics of the data set, see here
  • no hypothesis testing involved, is used to generate hypothesis
  • complete data set is used
  • methods: mean, median, boxplots, histograms, clustering, principal component analysis, nonmetric multidimensional scaling

Example: analysis of vegetation data from Alps excursion

1.3 Prediction

  • prediction is the process of estimating the value of a variable of interest based on other variables
  • data set is split into training and test set
  • selection of variables is based on predictive performance
  • the main interest is not to understand how predictions are made (e.g. with neural networks or by the use of model averaging) but to make accurate predictions
  • overfitting is a major problem
  • methods: random forest, neural networks

There is a large overlap between prediction and machine learning:

M. Pichler and F. Hartig, “Machine learning and deep learning—A review for ecologists,” Methods in Ecology and Evolution, 2023. doi: 10.1111/2041-210x.14061.

Example: occupancy model of Black Kite in Spain under current and future climatic conditions

1.4 Inference

  • infer properties of the a population by testing hypotheses and derive estimates, observed data is part of a larger population
  • complete data set is used
  • be careful with interpreting the estimates in a causal way, for observational data better write “is associated with” instead of “causes” or “leads to”
  • selection of variables is based on hypothesis
  • methods: (bayesian) graphical models, structual equation models

1.5 What can we do with (generalized) linear models?

  • linear models are a very flexible tool for exploration, prediction, and inference
  • extra care is needed to not mix all three objectives in one analysis

2 Introduction to Causal Inference

  • not always when you do inference you can interpret the results as causal effects
  • first, we will do a theoretical introduction and then a simulation experiment to illustrate the problem

2.1 Causal graphs

  • draw a directed acyclic graph that contains the complete causal structure
  • confounders influence both \(x\) and \(y\)
  • colliders are influenced by both \(x\) and \(y\)
  • mediators are influenced by \(x\) and influence \(y\)
flowchart LR
  A[Variable x] --> M[Mediator] 
  M --> O[Outcome y]
  A ==> O
  C[Confounder] --> O
  C --> A
  A --> D
  O --> D[Collider]
  
style A fill:#f80
style O fill:#f80

see also Mediators, confounders, colliders – a crash course in causal inference by Florian Hartig

3 Simulation experiment: Collider and Confounder Bias

  • the true effect size of \(x\) on \(y\) is 0.7, we have the following causal structure:
flowchart LR
  X -->|0.7| Y
  Confounder -->|0.14| X
  Confounder -->|0.11| Y
  X -->|0.43| Collider
  Y -->|0.21| Collider

3.1 Generate test data

  • we generate data that follows this causal structure and we will fit different models to the data and compare the estimated effect size of \(x\) on \(y\) to the true effect size
library(dplyr)
library(ggplot2)
library(piecewiseSEM)

set.seed(123)
n <- 2000
true_effect_size <- 0.7
confounder_var <- rnorm(n)
x_var <- rnorm(n, 0.14*confounder_var, 0.4)
y_var <- rnorm(n, true_effect_size*x_var + 0.11*confounder_var, 0.24)
collider_var <- rnorm(n, 0.43*x_var + 0.21*y_var, 0.22)

df <- tibble(
  x = x_var,
  y = y_var,
  confounder = confounder_var,
  collider = collider_var)

3.2 Include only x → y

Our assumed structure is as follows:

flowchart LR
  X --> Y
lm1 <- lm(y ~ x, data = df)
c1 <- coefs(lm1) # is needed later
coef(lm1)
(Intercept)           x 
0.003660183 0.791415440 
  • if not all confounding factors are included, the estimate is biased.

3.3 Include all variables

Our next idea could be to include all variables as predictors (x, confounder, and collider). Our assumed structure:

flowchart LR
  X --> Y
  Confounder --> Y
  Collider --> Y
lm2 <- lm(y ~ x + confounder + collider, data = df)
c2 <- coefs(lm2) # is needed later
coef(lm2)
(Intercept)           x  confounder    collider 
0.001066547 0.578216064 0.101149710 0.225615265 
  • if a collider is included in the model, the effect of \(x\) on \(y\) is biased

3.4 Include all variables but collider

Our next idea could be to include all variables as predictors but not the collider. Our assumed structure:

flowchart LR
  X --> Y
  Confounder --> Y
lm3 <- lm(y ~ x + confounder, data = df)
c3 <- coefs(lm3)
coef(lm3)
(Intercept)           x  confounder 
0.000406101 0.708802834 0.106130448 

3.5 Use a structural equation model

Another option is to specify the complete structure. This model is called a structural equation model (SEM).

mod <- psem(
  lm(y ~ confounder + x, data = df),
  lm(x ~ confounder, data = df),
  lm(collider ~ x + y, data = df)
)
c4 <- coefs(mod, standardize = "none")
c4[2, ]
  Response Predictor Estimate Std.Error   DF Crit.Value P.Value    
2        y         x   0.7088    0.0136 1997    52.2153       0 ***
plot(mod, show = "unstd")   
  • There is a complete course on SEMs taught by Oksana Buzhdygan and Felix May in the winter semester.

3.5 Use a structural equation model

3.6 Summary

Code
df_summary <- tibble(
  estimate = c(c1$Estimate, c2$Estimate[1], c3$Estimate[1], c4$Estimate[2]),
  SE = c(c1$Std.Error, c2$Std.Error[1], c3$Std.Error[1], c4$Std.Error[2]),
  label = factor(
    c("only x", "all variables", "x + confounder", "SEM"),
    levels = c("only x", "all variables", "x + confounder", "SEM"))
)

df_summary %>%
  ggplot() +
    geom_point(aes(estimate, label), size = 3) +
    geom_vline(aes(xintercept = true_effect_size),
               color = "orange", linewidth = 1.5) +
    annotate("text",
             x = true_effect_size, y = 4.3, label = "true effect size",
             color = "orange", size = 7, hjust = -0.05) +
    geom_errorbar(aes(
      xmin = estimate - SE, xmax = estimate + SE,
      y = label), width = 0.0) +
    labs(y = "predictor variables included", x = "Effect size of x → y (± SE)") +
    theme_classic() +
    theme(text = element_text(size = 18))

Further reading: Causal Inference: have you been doing science wrong all this time? implements the same analysis in Python and with a Bayesian perspective

3.7 How to specify the model with R

flowchart LR
  A[Variable x] --> O[Outcome y]
  D[Confounder] --> O
  
style A fill:#f80
style O fill:#f80
  • include your variable of interest, all confounders, but no colliders on the right side of your model formula y ~ x + confounder1 + confounder2
  • mediators can be omitted, if you are not interested to disentangle the direct and indirect effect of \(x\) on \(y\)

4 Difference between experimental and observational data

Experiments:

  • we either directly measure confounding variables or we can assume that confounders do not vary between measurements

Observations:

  • it is hardly possible to measure all confounders
  • causal inference is difficult

5 How do we come up with a causal structure?

  • check if there is already a mechanistic understanding of the system
  • ideas by exploratory data analysis
  • there is also a field called causal discovery that tries to infer causal structures from data