Linear models - Assumptions

Day 5 and 6

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 19, 2024

1 Reproduce slides

library(dplyr)
library(ggplot2)
library(ggfortify)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))

## set seed for reproducibility (random numbers generation)
set.seed(1) 

Plot fitted model

plot_model <- function(data) {
  ggplot(data, aes(x, y)) +
    geom_point(size = 2) +
    geom_ribbon(aes(ymin = y_pred - y_pred_se, ymax = y_pred + y_pred_se),
                alpha = 0.3) +
    geom_line(aes(y = y_pred), color = "red", linewidth = 1)
}

Compare normal distribution with kernel density estimate

normal_vs_kernel_density <- function(df, sd_residual) {
  residual_df <- tibble(
    x = seq(qnorm(0.005, mean = 0, sd = sd_residual), 
            qnorm(0.995, mean = 0, sd = sd_residual), 
            length.out = 100),
    y = dnorm(x, 
        mean = 0, sd = sd_residual)
  )
  
  ggplot() +
    geom_point(aes(residual, y = 0), data = df,
               alpha = 0.5) +
    geom_density(aes(residual, fill = "Kernel density"), data = df, 
                 linewidth = 0.5, alpha = 0.1) +
    geom_ribbon(aes(x, ymax = y, fill = "Normal distribution"), 
                data = residual_df, ymin = 0, 
                linewidth = 0.5, color = "black", alpha = 0.2) +
    labs(fill = "Distribution",
         y = "density")
}

Residual plot

plot_residuals <- function(df, sd_residual) {
  df <- df %>% arrange(residual) %>%
    mutate(residuals_quantiles = 1:n() / n(), 
           theoretical_quantiles = qnorm(residuals_quantiles, 
                                         mean = 0, 
                                         sd = sd_residual))

  ggplot(df, aes(theoretical_quantiles, residual)) + 
    geom_point(size = 2, alpha = 0.5) + 
    geom_abline(slope = 1, intercept = 0, 
                color = "red", linewidth = 1) + 
    labs(x = "theoretical residual", 
         y = "actual residual") +
    coord_fixed()
}

2 Overview about the assumptions of linear models

  • linear relationship between response and explanatory variable
  • residuals should follow a Normal distribution
  • residuals should have a constant variance and no autocorrelation
  • no strong outliers

3 Why do we (should) care about the assumptions?

  • you can fit linear models even if the assumptions are violated, however …
    • the (uncertainty of the) predictions of the model are not reliable
    • hypothesis tests are not reliable
  • it is best to think about your data when you design your study and before collecting the data

4 Residuals should follow a Normal distribution

The residuals \(\epsilon\) of a linear model should follow a Normal distribution with mean 0 and standard deviation of \(\sigma\):

\[ \begin{align} \mu &= \alpha + \beta \cdot x \quad (\text{or: } \hat y) \\ y &= \mu + \epsilon \\ \epsilon &\sim N(0, \sigma) \end{align} \] you can also write the model in the following way:

\[ y \sim N(\alpha + \beta \cdot x, \sigma) \]

4.1 Everything is fine

Code
df1 <- tibble(x = seq(0, 10, length.out = 100),
             y = 0.8 * x + 1 + rnorm(100, sd = 2))
lm1 <- lm(y ~ x, data = df1)

df1 <- df1 %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 
sd_residual1 <- sd(df1$residual)

plot_model(df1)

normal_vs_kernel_density(df1, sd_residual1)

plot_residuals(df1, sd_residual1)

4.2 Residuals with different distribution I

Code
df_other_dist <- tibble(x = seq(0, 10, length.out = 150),
                        y = 0.8 * x + 1 + rexp(150, rate = 2))
lm1 <- lm(y ~ x, data = df_other_dist)

df_other_dist <- df_other_dist %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 
sd_residual_other_dist <- sd(df_other_dist$residual)

plot_model(df_other_dist)

normal_vs_kernel_density(df_other_dist, sd_residual_other_dist)

plot_residuals(df_other_dist, sd_residual_other_dist)

The noise of the data is generated here with an Exponential distribution. Therefore the distribution of the residuals is strongly right skewed.

4.3 Residuals with different distribution II

Code
df_other_dist2 <- tibble(x = seq(0, 10, length.out = 150),
                        y = 0.8 * x + 1 + rt(150, df = 2))
lm1 <- lm(y ~ x, data = df_other_dist2)

df_other_dist2 <- df_other_dist2 %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 
sd_residual_other_dist2 <- sd(df_other_dist2$residual)

plot_model(df_other_dist2)

normal_vs_kernel_density(df_other_dist2, sd_residual_other_dist2)

plot_residuals(df_other_dist2, sd_residual_other_dist2)

The noise of the data is generated here with a Student’s t-distribution with 2 degrees of freedom. Therefore, unlike most data points, we have a few points that are quite far away from the mean (too heavy tails). The points look like outliers.

4.4 (Too) few data points

Code
df_few <- tibble(x = seq(0, 10, length.out = 10),
             y = 0.8 * x + 1 + rnorm(10, sd = 2))
lm1 <- lm(y ~ x, data = df_few)

df_few <- df_few %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 
sd_residual_few <- sd(df_few$residual)

plot_model(df_few)

normal_vs_kernel_density(df_few, sd_residual_few)

plot_residuals(df_few, sd_residual_few)

Even if the residuals are normally distributed in this example, it is often hard to tell if the residuals are normally distributed if you have only a few data points. You should try to get more data points, when you want to use a linear model.

5 Constant variance and no autocorrelation of residuals

Recall: \[ \begin{align} \mu &= \alpha + \beta \cdot x \quad (\text{or: } \hat y) \\ y &= \mu + \epsilon \\ \epsilon &\sim N(0, \sigma) \end{align} \]\(\sigma\) is not dependent on \(x\) (homoscedasticity).

  • If \(\sigma\) is dependent on \(x\), we have heteroscedasticity.
  • no autocorrelation: the residuals are not dependent on each other.

5.1 Homoscedasticity (constant variance of residuals)

Code
df_homo <- tibble(x = seq(0, 10, length.out = 100),
                    y = 0.8 * x + 1 + rnorm(100, sd = 2))
lm1 <- lm(y ~ x, data = df_homo)

df_homo <- df_homo %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 

plot_model(df_homo)

Code
ggplot(df_homo, aes(x = y_pred, y = residual)) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
  geom_point() +
  labs(x = "predicted y", y = "residuals")

The variance of the residuals is independent of \(\hat y\), everything is fine.

5.2 Hetereoscedasticity I

Code
df_hetero <- tibble(x = seq(0, 10, length.out = 100),
                    y = 0.8 * x + 1 + rnorm(100, sd = 2 * x))
lm1 <- lm(y ~ x, data = df_hetero)

df_hetero <- df_hetero %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 

plot_model(df_hetero)

Code
ggplot(df_hetero, aes(x = y_pred, y = residual)) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
  geom_point() +
  labs(x = "predicted y", y = "residuals")

The variance of the residuals is dependent on \(\hat y\). This violates the assumption of homoscedasticity.

5.3 Hetereoscedasticity II

Code
df_hetero2 <- tibble(x = seq(0, 10, length.out = 100),
                    y = rpois(100, 5 * x + 1))
lm1 <- lm(y ~ x, data = df_hetero2)

df_hetero2 <- df_hetero2 %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 

plot_model(df_hetero2)

Code
ggplot(df_hetero2, aes(x = y_pred, y = residual)) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
  geom_point() +
  labs(x = "predicted y", y = "residuals")

The variance of the residuals is dependent on \(\hat y\). This violates the assumption of homoscedasticity.

To generate this example, we used a Poisson error distribution for \(y\). Therefore, using a GLM with a Poisson error distribution would be a better choice. When we use the Poisson error distribution, it is directly assumed that the variance of the residuals is dependent on \(x\). It is important to model this explicitly, because otherwise the uncertainty of your model predictions are not correct.

5.4 Autocorrelation of residuals

Code
df_auto <- tibble(x = seq(0, 10, length.out = 100),
                  y = rnorm(100, mean = 5 * x, sd = 15))

## we add here some autocorrelation
for (i in 2:nrow(df_auto)) {
  df_auto$y[i] <- df_auto$y[i] + 0.9 * df_auto$y[i - 1]
}


lm1 <- lm(y ~ x, data = df_auto)

df_auto <- df_auto %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 

plot_model(df_auto)

Code
ggplot(df_auto, aes(x = y_pred, y = residual)) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
  geom_point() +
  labs(x = "predicted y", y = "residuals")

The residuals are autocorrelated (you can see a pattern in the residuals). Here, it is better to use a time series model. A time series model can take into account the autocorrelation of the residuals.

6 No strong outliers

  • easiest option is to check visually for outliers
  • leverage measures how far away the values of the independent variables of one observation are from the values of the other observations

7 Convenience function in R

df <- tibble(x = seq(0, 10, length.out = 100),
             y = rnorm(100, mean = 0.8 * x + 1, sd = 4))
lm1 <- lm(y ~ x, data = df)
autoplot(lm1, which = 1:2)

8 What to do if you notice violations?

  • residuals are not normally distributed:
    • transform the response variable \(y\) (see day 7)
    • use generalized linear models (GLM, see day 8)
    • maybe your measured \(x\) variables cannot explain enough variance in \(y\)?
  • no constant variance of residuals:
    • some problems are already resolved if you do transformation / use a GLM
    • it is possible to model \(\sigma\) dependent on \(x\)
  • residuals are not independent (are autocorrelated):
    • maybe you have to look for a different model, e.g. time series models
  • strong outliers:
    • check if there is a data issue, e.g. a typo
    • it is possible to use instead of the Normal error distribution the Student’s t-distribution (“robust” regression), the Student’s t-distribution has an additional parameter that controls how heavy the tails are