Introduction to Generalized Linear Models (GLMs) – Data types

Day 8

Felix May

Freie Universität Berlin @ Theoretical Ecology

Reproduce slides

library(tidyr)
library(dplyr)
library(ggplot2)
library(forcats)
library(readr)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))

Why Generalized Linear Models?

Example 1: Study of density-dependent seed germination
- 46 pots with 10, 20, … 100 seeds per pot
- 6 replicates for each density
- Observation: No. of seedlings in each pot after 10 days

Question: How does the probability of seed germination vary with the density (= seeds per pot)?

Plotting the germination data

Data generation

no_of_seeds <- rep(seq(10, 100, by = 10), each = 6)

a <- 7
b <- -0.12
lin_pred <- a + no_of_seeds*b
pi <- exp(lin_pred)/(1 + exp(lin_pred))

y <- rbinom(length(pi), size = no_of_seeds, prob = pi)

dat1 <- tibble(no_of_seeds,
               no_of_seedlings = y)

Have a look the data

dat1

# A tibble: 60 × 2
   no_of_seeds no_of_seedlings
         <dbl>           <int>
 1          10              10
 2          10               9
 3          10              10
 4          10              10
 5          10              10
 6          10              10
 7          20              20
 8          20              20
 9          20              20
10          20              20
# ℹ 50 more rows

First plot

ggplot(dat1, aes(no_of_seeds, no_of_seedlings)) +
  geom_point(size = 3, col = "skyblue3") +
  xlab("No. of seeds") +
  ylab("No. of germinated seedlings")

Plotting the germination data

Calculating the proportion of germinated seeds

dat2 <- dat1 %>%
  mutate(prop_germinated = no_of_seedlings/no_of_seeds)

Second plot

ggplot(dat2, aes(no_of_seeds, prop_germinated)) +
  geom_point(size = 3, col = "skyblue3") +
  xlab("No. of seeds") +
  ylab("Proportion of germinated seeds") +
  ggtitle("Density dependent germination rate")

Naive modelling with `lm()`

# Fit model
lm1 <- lm(prop_germinated ~ no_of_seeds, data = dat2)

# Form predictions
df_pred <- tibble(no_of_seeds = 10:100) %>%
  mutate(y_pred = predict(lm1, newdata = .))

What are the problems here?
- Data: Proportions –> in the interval 0 – 1
- Predictions of linear model –> unbounded: <0 and >1
- Data: non-linear relationship between x and y
- Model: Linear relationship

Data and model

ggplot(dat2, aes(no_of_seeds, prop_germinated)) +
  geom_point(size = 3, col = "skyblue3") +
  xlab("No. of seeds") +
  ylab("Proportion of germinated seeds") +
  ggtitle("Density dependent germination rate") +
  geom_line(aes(y = y_pred), data = df_pred, color = "red", linewidth = 2)

From linear to generalized linear models

Linear models:

Normal distribution of residuals –> unbounded \([- \infty, \infty]\)
Linear relationship between X and mean(Y)
Constant variance

Our seed germination example

Bounded data \([0, 1]\)
Non-linear relationship between X and mean(Y)
Low variance close to 0 and 1, higher in the middle

The Binomial distribution

When we sow 10 seeds – what is the probability of observing a certain no. of seedlings?
- Assumption: Germination probability = 0.2 (20%) for each seed
Described by Binomial distribution for proportion data
- Discrete distribution – only defined for integer numbers \(\geq 0\)
- Parameters:
  - N … no. of trials (= seeds)
  - p … “success” probability (germination) in a single trial (= seed)

The probability of getting exactly k successes in N independent Bernoulli trials (with the same success probability p)

Bernoulli trial: Event with two potential outcomes: “success” (p), or “failure” (1 – p)

Explore the Binomial distribution

What is the probability of observing 0, 1, 2, … 10 seedlings?
How does the probability mass function (PMF) of the Binomial distribution look like?
- N = 10, p = 0.2

N <- 10
p <- 0.2

no_of_seeds <- 0:N
pmf <- dbinom(no_of_seeds, size = N, prob = p)

binom1 <- tibble(no_of_seeds, pmf)

Code

ggplot(binom1, aes(no_of_seeds, pmf)) +
  geom_col(fill = "skyblue") +
  scale_x_continuous(breaks = 0:N)

Why Generalized Linear Models?

Example 2: Forest regeneration in beech forests
- 50 sampling plots of 2m x 2m size
- Observation: No. of beech seedlings in each plot

Question: How does the number of beech seedlings vary with soil moisture in the plots?

Exploring the beech seedling data

Data generation

soil_moisture <- runif(n = 50, min = 10, max = 50)

a <- -2
b <- 0.1
lin_pred <- a + soil_moisture*b
lambda <- exp(lin_pred)

y <- rpois(length(lambda), lambda = lambda)

dat1 <- tibble(soil_moisture,
               no_of_seedlings = y)

Have a look the data

dat1

# A tibble: 50 × 2
   soil_moisture no_of_seedlings
           <dbl>           <int>
 1          20.1               4
 2          41.2               7
 3          28.3               3
 4          39.4               4
 5          12.6               1
 6          31.0               2
 7          34.4               5
 8          49.9              22
 9          19.8               0
10          33.0               3
# ℹ 40 more rows

Figure code

ggplot(dat1, aes(soil_moisture, no_of_seedlings)) +
  geom_point(size = 3, col = "seagreen") +
  xlab("Soil moisture") +
  ylab("No. of seedlings")

Naive modelling with `lm()`

# Fit model
lm1 <- lm(no_of_seedlings ~ soil_moisture, data = dat1)

# Form predictions
df_pred <- tibble(soil_moisture = seq(min(dat1$soil_moisture),
                                      max(dat1$soil_moisture),
                                      length = 100)) %>%
  mutate(y_pred = predict(lm1, newdata = .))

What are the problems here?
- Data: Counts –> always \(> 0\)
- Predictions of linear model –> unbounded: \(<0\) and \(>1\)
- Data: non-linear relationship between x and y
- Model: Linear relationship

Data and model

ggplot(dat1, aes(soil_moisture, no_of_seedlings)) +
  geom_point(size = 3, col = "seagreen") +
  xlab("Soil moisture") +
  ylab("No. of seedlings") +
  geom_line(aes(y = y_pred), data = df_pred, color = "red")

The Poisson distribution for count data

When there are 2.5 seedlings on average per plot – what is the probability of observing a 0, 1, 3, … seedlings in a plot?
- Assumption: On average 2.5 seedlings per 4m².
Described by Poisson distribution for count data
- Discrete distribution – only defined for integer numbers \(\geq 0\)
- Parameters:
  - \(\lambda\) … mean rate (= mean no. of seedlings per plot)

The probability of a given number of events (= seedlings) in a fixed interval of time or space if these events occur with a constant mean rate.

Explore the Poisson distribution

What is the probability of observing 0, 1, 2, … n seedlings?
How does the probability mass function (PMF) of the Poisson distribution look like?
- \(\lambda\) = 2.5

lambda <- 2.5
no_of_seedlings <- 0:10

probability <- dpois(no_of_seeds, lambda = lambda)

poisson1 <- tibble(no_of_seedlings, probability)

ggplot(poisson1, aes(no_of_seedlings, probability)) +
  geom_col(fill = "seagreen") +
  scale_x_continuous(breaks = 0:10)

Exercises

Explore the shapes of the Poisson and binomial distribution!

How does the Poisson distribution vary with increasing \(\lambda\)?
How does the Binomial distribution vary with changing p (and constant N)?
How does the Binomial distribution vary with changing N (and constant p)?

Introduction to Generalized Linear Models (GLMs) – Data types

Reproduce slides

Why Generalized Linear Models?

Plotting the germination data

Plotting the germination data

Naive modelling with lm()

From linear to generalized linear models

Linear models:

Our seed germination example

The Binomial distribution

Explore the Binomial distribution

Why Generalized Linear Models?

Exploring the beech seedling data

Naive modelling with lm()

The Poisson distribution for count data

Explore the Poisson distribution

Exercises

Naive modelling with `lm()`

Naive modelling with `lm()`