library(tidyr)
library(dplyr)
library(ggplot2)
library(readr)
library(forcats)
library(ggfortify)
library(MASS)
## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))Day 8
Freie Universität Berlin @ Theoretical Ecology
Food for thought
Think about examples for count, proportion, and binary data in your favourite sub-field of Biology!



Linear models not appropriate for count, proportion and binary data!
soil_moisture <- runif(n = 50, min = 10, max = 50)
a <- -2
b <- 0.1
lin_pred <- a + soil_moisture*b
lambda <- exp(lin_pred)
y <- rpois(length(lambda), lambda = lambda)
dat1 <- tibble(soil_moisture,
no_of_seedlings = y)
ggplot(dat1, aes(soil_moisture, no_of_seedlings)) +
geom_point(size = 3, col = "seagreen") +
xlab("Soil moisture") +
ylab("No. of seedlings") +
geom_smooth(method = "glm", method.args = list(family = "poisson"),
color = "red")
no_of_seeds <- rep(seq(10, 100, by = 10), each = 6)
a <- 7
b <- -0.12
lin_pred <- a + no_of_seeds*b
pi <- exp(lin_pred)/(1 + exp(lin_pred))
y <- rbinom(length(pi), size = no_of_seeds, prob = pi)
dat1 <- tibble(no_of_seeds,
no_of_seedlings = y) %>%
mutate(prop_germinated = no_of_seedlings / no_of_seeds)
ggplot(dat1, aes(no_of_seeds, prop_germinated)) +
geom_point(size = 3, col = "skyblue3") +
xlab("No. of seeds") +
ylab("Proportion of germinated seeds") +
ggtitle("Density dependent germination rate") +
geom_smooth(method = "glm", method.args = list(family = "binomial"),
color = "red")
Starting point: Linear models
\[y_i \sim \mathcal{N}(\mu_i, \sigma^2)\]
\[\mu_i = \alpha + \sum \beta_j x_{i,j}\]
GLM: 3 components
Like in LM: Linear combination of the effects of the predictors \(X_j\)
\[\eta_i = \alpha + \sum{\beta_j x_{i,j}}\]
Links the mean response \((\mu_i)\) to the linear predictor \((\eta_i)\): \(g(\mu_i) = \eta_i\)
Inverse link function converts linear predictor to response: \(\mu_i = g^-1(\eta_i)\)
dat <- tibble(x = seq(0,10, by = 0.01),
y = log(x))
ggplot(dat, aes(x,y)) +
geom_line() +
ggtitle("η = log(𝛌)") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
xlab("𝛌 - mean response") +
ylab("η - linear predictor")
dat <- tibble(x = seq(-2,4, by = 0.01),
y = exp(x))
ggplot(dat, aes(x,y)) +
geom_line() +
ggtitle("𝛌 = exp(η)") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
xlab("η - linear predictor" ) +
ylab("𝛌 - mean response")
dat <- tibble(p = seq(0,1, by = 0.001),
y = log(p / (1 - p)))
ggplot(dat, aes(p,y)) +
geom_line() +
ggtitle("η = logit(p) = log(p / (1 - p))") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
xlab("p - mean response") +
ylab("η - linear predictor")
dat <- tibble(y = seq(-5,5, by = 0.01),
p = exp(y)/(1 + exp(y)))
ggplot(dat, aes(y,p)) +
geom_line() +
ggtitle("p = exp(η)/(1 + exp(η))") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
xlab("η - linear predictor" ) +
ylab("p - mean response")
RDataset plant species diversity and biomass
Species)Poisson GLM
glm() function -> similar to lmfamily argumentfamily = "poisson"Challenge
Call:
glm(formula = Species ~ Biomass * pH, family = "poisson", data = specdat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.95255 0.08240 35.833 < 2e-16 ***
Biomass -0.26216 0.03803 -6.893 5.47e-12 ***
pHmid 0.48411 0.10723 4.515 6.34e-06 ***
pHhigh 0.81557 0.10284 7.931 2.18e-15 ***
Biomass:pHmid 0.12314 0.04270 2.884 0.003927 **
Biomass:pHhigh 0.15503 0.04003 3.873 0.000108 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 452.346 on 89 degrees of freedom
Residual deviance: 83.201 on 84 degrees of freedom
AIC: 514.39
Number of Fisher Scoring iterations: 4

\[\text{Residual Deviance} = -2 \left[ \log\mathcal{L(\text{fitted model})} - \log\mathcal{L(\text{saturated model}}) \right]\]
\[r_i^D = \text{sign}(y_i - \mu_i)\sqrt{d_i}\]
\[r_i^P = \frac{y_i - \mu_i}{\text{sd}(\mu_i)}\]
RR
residuals(mod1, type = "deviance")residuals(mod1, type = "pearson")?residuals.glm\[ LR = \log\mathcal{L}(mod1) - \log\mathcal{L}(mod2) = \log\left(\frac{\mathcal{L}(mod1)}{\mathcal{L}(mod2)} \right)\]
drop1(mod1, test = "Chi")Single term deletions
Model:
Species ~ Biomass * pH
Df Deviance AIC LRT Pr(>Chi)
<none> 83.201 514.39
Biomass:pH 2 99.242 526.43 16.04 0.0003288 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<none>Biomass:pHpredict()Important
By default predictions for GLMs are on the link function scale.
Use argument type = "response" for predictions on the scale of the response variable!
| Linear models | Generalized linear models | |
|---|---|---|
| Model fitting | lm() |
|
| Checking model assumptions | autoplot() |
autoplot() |
| Looking at coefficients | summary() |
summary() |
| Model predictions | predict() |
|
| Hypothesis testing | drop1(..., test = "F") |
drop1(..., test = "Chi") |
Tip
For help on specific function use ?autoplot.lm, ?predict.glm, etc.
Generalized linear models