Data transformation for linear models

Day 7

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 23, 2024

Reproduce slides

library(tidyr)
library(dplyr)
library(ggplot2)
library(ggfortify)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))

1 When to transform data?

The relationhip between the response and the predictor is non-linear/exponential \(\rightarrow\) log-transform the predictor variable
You want an interpretable intercept \(\rightarrow\) center the predictor variables
You want to compare the coefficients of different predictors \(\rightarrow\) standardize (center and divide by the standard deviation) the predictor variables
You have quantities with several orders of magnitude \(\rightarrow\) log-transform the variables to be able to compare them

2 When to better not use transformations?

often it’s better to use a generalized linear model (GLM) instead of a transformation of the response variable, especially if you want to test a hypothesis

see, for example: R. B. O’Hara and D. J. Kotze, “Do not log‐transform count data,” Methods in Ecology and Evolution, 2010. doi: 10.1111/j.2041-210x.2010.00021.x.

3 Transformation for visualisation

Source on wikimedia

4 Workflow

flowchart LR
  A[fit model\nto raw\ndata] --> B[check\nassumptions]
  B -->|if not\nfulfilled| T[transform\ndata]
  T --> J
  J[fit model to\ntransformed\ndata] --> K[check\nassumptions]
  K --> P
  B -->|if fulfilled| P
  P[make\npredictions] --> S[plot raw\ndata and\npredictions]
  
style T fill:#f80