Introduction to Generalized Linear Models (GLMs) – Data types

Day 8

Felix May

Freie Universität Berlin @ Theoretical Ecology

Reproduce slides

library(tidyr)
library(dplyr)
library(ggplot2)
library(forcats)
library(readr)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))

Why Generalized Linear Models?

  • Example 1: Study of density-dependent seed germination
    • 46 pots with 10, 20, … 100 seeds per pot
    • 6 replicates for each density
    • Observation: No. of seedlings in each pot after 10 days

Question: How does the probability of seed germination vary with the density (= seeds per pot)?

Plotting the germination data

Data generation
no_of_seeds <- rep(seq(10, 100, by = 10), each = 6)

a <- 7
b <- -0.12
lin_pred <- a + no_of_seeds*b
pi <- exp(lin_pred)/(1 + exp(lin_pred))

y <- rbinom(length(pi), size = no_of_seeds, prob = pi)

dat1 <- tibble(no_of_seeds,
               no_of_seedlings = y) 
  • Have a look the data
dat1
# A tibble: 60 × 2
   no_of_seeds no_of_seedlings
         <dbl>           <int>
 1          10              10
 2          10               9
 3          10              10
 4          10              10
 5          10              10
 6          10              10
 7          20              20
 8          20              20
 9          20              20
10          20              20
# ℹ 50 more rows
First plot
ggplot(dat1, aes(no_of_seeds, no_of_seedlings)) +
  geom_point(size = 3, col = "skyblue3") +
  xlab("No. of seeds") +
  ylab("No. of germinated seedlings")

Plotting the germination data

  • Calculating the proportion of germinated seeds
dat2 <- dat1 %>%
  mutate(prop_germinated = no_of_seedlings/no_of_seeds)
Second plot
ggplot(dat2, aes(no_of_seeds, prop_germinated)) +
  geom_point(size = 3, col = "skyblue3") +
  xlab("No. of seeds") +
  ylab("Proportion of germinated seeds") +
  ggtitle("Density dependent germination rate")

Naive modelling with lm()

# Fit model
lm1 <- lm(prop_germinated ~ no_of_seeds, data = dat2)

# Form predictions
df_pred <- tibble(no_of_seeds = 10:100) %>%
  mutate(y_pred = predict(lm1, newdata = .))
  • What are the problems here?
    • Data: Proportions –> in the interval 0 – 1
    • Predictions of linear model –> unbounded: <0 and >1
    • Data: non-linear relationship between x and y
    • Model: Linear relationship
Data and model
ggplot(dat2, aes(no_of_seeds, prop_germinated)) +
  geom_point(size = 3, col = "skyblue3") +
  xlab("No. of seeds") +
  ylab("Proportion of germinated seeds") +
  ggtitle("Density dependent germination rate") +
  geom_line(aes(y = y_pred), data = df_pred, color = "red", linewidth = 2)

From linear to generalized linear models

Linear models:

  • Normal distribution of residuals –> unbounded \([- \infty, \infty]\)
  • Linear relationship between X and mean(Y)
  • Constant variance

Our seed germination example

  • Bounded data \([0, 1]\)
  • Non-linear relationship between X and mean(Y)
  • Low variance close to 0 and 1, higher in the middle

The Binomial distribution

  • When we sow 10 seeds – what is the probability of observing a certain no. of seedlings?
    • Assumption: Germination probability = 0.2 (20%) for each seed
  • Described by Binomial distribution for proportion data
    • Discrete distribution – only defined for integer numbers \(\geq 0\)
    • Parameters:
      • N … no. of trials (= seeds)
      • p … “success” probability (germination) in a single trial (= seed)

The probability of getting exactly k successes in N independent Bernoulli trials (with the same success probability p)

Bernoulli trial: Event with two potential outcomes: “success” (p), or “failure” (1 – p)

Explore the Binomial distribution

  • What is the probability of observing 0, 1, 2, … 10 seedlings?
  • How does the probability mass function (PMF) of the Binomial distribution look like?
    • N = 10, p = 0.2
N <- 10
p <- 0.2
no_of_seeds <- 0:N
pmf <- dbinom(no_of_seeds, size = N, prob = p)

binom1 <- tibble(no_of_seeds, pmf)
Code
ggplot(binom1, aes(no_of_seeds, pmf)) +
  geom_col(fill = "skyblue") +
  scale_x_continuous(breaks = 0:N)

Why Generalized Linear Models?

  • Example 2: Forest regeneration in beech forests
    • 50 sampling plots of 2m x 2m size
    • Observation: No. of beech seedlings in each plot

Question: How does the number of beech seedlings vary with soil moisture in the plots?

Exploring the beech seedling data

Data generation
soil_moisture <- runif(n = 50, min = 10, max = 50)

a <- -2
b <- 0.1
lin_pred <- a + soil_moisture*b
lambda <- exp(lin_pred)

y <- rpois(length(lambda), lambda = lambda)

dat1 <- tibble(soil_moisture,
               no_of_seedlings = y) 
  • Have a look the data
dat1
# A tibble: 50 × 2
   soil_moisture no_of_seedlings
           <dbl>           <int>
 1          20.1               4
 2          41.2               7
 3          28.3               3
 4          39.4               4
 5          12.6               1
 6          31.0               2
 7          34.4               5
 8          49.9              22
 9          19.8               0
10          33.0               3
# ℹ 40 more rows
Figure code
ggplot(dat1, aes(soil_moisture, no_of_seedlings)) +
  geom_point(size = 3, col = "seagreen") +
  xlab("Soil moisture") +
  ylab("No. of seedlings")

Naive modelling with lm()

# Fit model
lm1 <- lm(no_of_seedlings ~ soil_moisture, data = dat1)

# Form predictions
df_pred <- tibble(soil_moisture = seq(min(dat1$soil_moisture),
                                      max(dat1$soil_moisture),
                                      length = 100)) %>%
  mutate(y_pred = predict(lm1, newdata = .))
  • What are the problems here?
    • Data: Counts –> always \(> 0\)
    • Predictions of linear model –> unbounded: \(<0\) and \(>1\)
    • Data: non-linear relationship between x and y
    • Model: Linear relationship
Data and model
ggplot(dat1, aes(soil_moisture, no_of_seedlings)) +
  geom_point(size = 3, col = "seagreen") +
  xlab("Soil moisture") +
  ylab("No. of seedlings") +
  geom_line(aes(y = y_pred), data = df_pred, color = "red")

The Poisson distribution for count data

  • When there are 2.5 seedlings on average per plot – what is the probability of observing a 0, 1, 3, … seedlings in a plot?
    • Assumption: On average 2.5 seedlings per 4m².
  • Described by Poisson distribution for count data
    • Discrete distribution – only defined for integer numbers \(\geq 0\)
    • Parameters:
      • \(\lambda\) … mean rate (= mean no. of seedlings per plot)

The probability of a given number of events (= seedlings) in a fixed interval of time or space if these events occur with a constant mean rate.

Explore the Poisson distribution

  • What is the probability of observing 0, 1, 2, … n seedlings?
  • How does the probability mass function (PMF) of the Poisson distribution look like?
    • \(\lambda\) = 2.5
lambda <- 2.5
no_of_seedlings <- 0:10

probability <- dpois(no_of_seeds, lambda = lambda)

poisson1 <- tibble(no_of_seedlings, probability)
ggplot(poisson1, aes(no_of_seedlings, probability)) +
  geom_col(fill = "seagreen") +
  scale_x_continuous(breaks = 0:10)

Exercises

Explore the shapes of the Poisson and binomial distribution!

  • How does the Poisson distribution vary with increasing \(\lambda\)?
  • How does the Binomial distribution vary with changing p (and constant N)?
  • How does the Binomial distribution vary with changing N (and constant p)?