Linear models - Assumptions

Day 5 and 6

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 19, 2024

1 Reproduce slides

library(dplyr)
library(ggplot2)
library(ggfortify)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))

## set seed for reproducibility (random numbers generation)
set.seed(1)

Plot fitted model

plot_model <- function(data) {
  ggplot(data, aes(x, y)) +
    geom_point(size = 2) +
    geom_ribbon(aes(ymin = y_pred - y_pred_se, ymax = y_pred + y_pred_se),
                alpha = 0.3) +
    geom_line(aes(y = y_pred), color = "red", linewidth = 1)
}

Compare normal distribution with kernel density estimate

normal_vs_kernel_density <- function(df, sd_residual) {
  residual_df <- tibble(
    x = seq(qnorm(0.005, mean = 0, sd = sd_residual), 
            qnorm(0.995, mean = 0, sd = sd_residual), 
            length.out = 100),
    y = dnorm(x, 
        mean = 0, sd = sd_residual)
  )
  
  ggplot() +
    geom_point(aes(residual, y = 0), data = df,
               alpha = 0.5) +
    geom_density(aes(residual, fill = "Kernel density"), data = df, 
                 linewidth = 0.5, alpha = 0.1) +
    geom_ribbon(aes(x, ymax = y, fill = "Normal distribution"), 
                data = residual_df, ymin = 0, 
                linewidth = 0.5, color = "black", alpha = 0.2) +
    labs(fill = "Distribution",
         y = "density")
}

Residual plot

plot_residuals <- function(df, sd_residual) {
  df <- df %>% arrange(residual) %>%
    mutate(residuals_quantiles = 1:n() / n(), 
           theoretical_quantiles = qnorm(residuals_quantiles, 
                                         mean = 0, 
                                         sd = sd_residual))

  ggplot(df, aes(theoretical_quantiles, residual)) + 
    geom_point(size = 2, alpha = 0.5) + 
    geom_abline(slope = 1, intercept = 0, 
                color = "red", linewidth = 1) + 
    labs(x = "theoretical residual", 
         y = "actual residual") +
    coord_fixed()
}

2 Overview about the assumptions of linear models

linear relationship between response and explanatory variable
residuals should follow a Normal distribution
residuals should have a constant variance and no autocorrelation
no strong outliers

3 Why do we (should) care about the assumptions?

you can fit linear models even if the assumptions are violated, however …
- the (uncertainty of the) predictions of the model are not reliable
- hypothesis tests are not reliable
it is best to think about your data when you design your study and before collecting the data

4 Residuals should follow a Normal distribution

The residuals \(\epsilon\) of a linear model should follow a Normal distribution with mean 0 and standard deviation of \(\sigma\):

\[ \begin{align} \mu &= \alpha + \beta \cdot x \quad (\text{or: } \hat y) \\ y &= \mu + \epsilon \\ \epsilon &\sim N(0, \sigma) \end{align} \] you can also write the model in the following way:

\[ y \sim N(\alpha + \beta \cdot x, \sigma) \]

4.1 Everything is fine

Fitted model
Kernel density estimate vs Normal distribution
QQ-Plot

Code

df1 <- tibble(x = seq(0, 10, length.out = 100),
             y = 0.8 * x + 1 + rnorm(100, sd = 2))
lm1 <- lm(y ~ x, data = df1)

df1 <- df1 %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 
sd_residual1 <- sd(df1$residual)

plot_model(df1)

normal_vs_kernel_density(df1, sd_residual1)

plot_residuals(df1, sd_residual1)

4.2 Residuals with different distribution I

Fitted model
Kernel density estimate vs Normal distribution
QQ-Plot
Explanation

Code

df_other_dist <- tibble(x = seq(0, 10, length.out = 150),
                        y = 0.8 * x + 1 + rexp(150, rate = 2))
lm1 <- lm(y ~ x, data = df_other_dist)

df_other_dist <- df_other_dist %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 
sd_residual_other_dist <- sd(df_other_dist$residual)

plot_model(df_other_dist)

normal_vs_kernel_density(df_other_dist, sd_residual_other_dist)

plot_residuals(df_other_dist, sd_residual_other_dist)

The noise of the data is generated here with an Exponential distribution. Therefore the distribution of the residuals is strongly right skewed.

4.3 Residuals with different distribution II

Fitted model
Kernel density estimate vs Normal distribution
QQ-Plot
Explanation

Code

df_other_dist2 <- tibble(x = seq(0, 10, length.out = 150),
                        y = 0.8 * x + 1 + rt(150, df = 2))
lm1 <- lm(y ~ x, data = df_other_dist2)

df_other_dist2 <- df_other_dist2 %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 
sd_residual_other_dist2 <- sd(df_other_dist2$residual)

plot_model(df_other_dist2)

normal_vs_kernel_density(df_other_dist2, sd_residual_other_dist2)

plot_residuals(df_other_dist2, sd_residual_other_dist2)

The noise of the data is generated here with a Student’s t-distribution with 2 degrees of freedom. Therefore, unlike most data points, we have a few points that are quite far away from the mean (too heavy tails). The points look like outliers.

4.4 (Too) few data points

Fitted model
Kernel density estimate vs Normal distribution
QQ-Plot
Explanation

Code

df_few <- tibble(x = seq(0, 10, length.out = 10),
             y = 0.8 * x + 1 + rnorm(10, sd = 2))
lm1 <- lm(y ~ x, data = df_few)

df_few <- df_few %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 
sd_residual_few <- sd(df_few$residual)

plot_model(df_few)

normal_vs_kernel_density(df_few, sd_residual_few)

plot_residuals(df_few, sd_residual_few)

Even if the residuals are normally distributed in this example, it is often hard to tell if the residuals are normally distributed if you have only a few data points. You should try to get more data points, when you want to use a linear model.

5 Constant variance and no autocorrelation of residuals

Recall: \[ \begin{align} \mu &= \alpha + \beta \cdot x \quad (\text{or: } \hat y) \\ y &= \mu + \epsilon \\ \epsilon &\sim N(0, \sigma) \end{align} \] → \(\sigma\) is not dependent on \(x\) (homoscedasticity).

If \(\sigma\) is dependent on \(x\), we have heteroscedasticity.
no autocorrelation: the residuals are not dependent on each other.

5.1 Homoscedasticity (constant variance of residuals)

Fitted model
Residuals as a function of \(\hat y\)
Explanation

Code

df_homo <- tibble(x = seq(0, 10, length.out = 100),
                    y = 0.8 * x + 1 + rnorm(100, sd = 2))
lm1 <- lm(y ~ x, data = df_homo)

df_homo <- df_homo %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 

plot_model(df_homo)

Code

ggplot(df_homo, aes(x = y_pred, y = residual)) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
  geom_point() +
  labs(x = "predicted y", y = "residuals")

The variance of the residuals is independent of \(\hat y\), everything is fine.

Code

df_hetero <- tibble(x = seq(0, 10, length.out = 100),
                    y = 0.8 * x + 1 + rnorm(100, sd = 2 * x))
lm1 <- lm(y ~ x, data = df_hetero)

df_hetero <- df_hetero %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 

plot_model(df_hetero)

Code

ggplot(df_hetero, aes(x = y_pred, y = residual)) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
  geom_point() +
  labs(x = "predicted y", y = "residuals")

The variance of the residuals is dependent on \(\hat y\). This violates the assumption of homoscedasticity.

5.3 Hetereoscedasticity II

Fitted model
Residuals as a function of \(\hat y\)
Explanation

Code

df_hetero2 <- tibble(x = seq(0, 10, length.out = 100),
                    y = rpois(100, 5 * x + 1))
lm1 <- lm(y ~ x, data = df_hetero2)

df_hetero2 <- df_hetero2 %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 

plot_model(df_hetero2)

Code

ggplot(df_hetero2, aes(x = y_pred, y = residual)) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
  geom_point() +
  labs(x = "predicted y", y = "residuals")

The variance of the residuals is dependent on \(\hat y\). This violates the assumption of homoscedasticity.

To generate this example, we used a Poisson error distribution for \(y\). Therefore, using a GLM with a Poisson error distribution would be a better choice. When we use the Poisson error distribution, it is directly assumed that the variance of the residuals is dependent on \(x\). It is important to model this explicitly, because otherwise the uncertainty of your model predictions are not correct.

5.4 Autocorrelation of residuals

Fitted model
Residuals as a function of \(\hat y\)
Explanation

Code

df_auto <- tibble(x = seq(0, 10, length.out = 100),
                  y = rnorm(100, mean = 5 * x, sd = 15))

## we add here some autocorrelation
for (i in 2:nrow(df_auto)) {
  df_auto$y[i] <- df_auto$y[i] + 0.9 * df_auto$y[i - 1]
}


lm1 <- lm(y ~ x, data = df_auto)

df_auto <- df_auto %>% 
  mutate(
    y_pred = predict(lm1),
    y_pred_se = predict(lm1, se.fit = T)$se.fit,
    residual = y - y_pred) 

plot_model(df_auto)

Code

ggplot(df_auto, aes(x = y_pred, y = residual)) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
  geom_point() +
  labs(x = "predicted y", y = "residuals")

The residuals are autocorrelated (you can see a pattern in the residuals). Here, it is better to use a time series model. A time series model can take into account the autocorrelation of the residuals.

6 No strong outliers

easiest option is to check visually for outliers
leverage measures how far away the values of the independent variables of one observation are from the values of the other observations

7 Convenience function in `R`

df <- tibble(x = seq(0, 10, length.out = 100),
             y = rnorm(100, mean = 0.8 * x + 1, sd = 4))
lm1 <- lm(y ~ x, data = df)
autoplot(lm1, which = 1:2)

8 What to do if you notice violations?

residuals are not normally distributed:
- transform the response variable \(y\) (see day 7)
- use generalized linear models (GLM, see day 8)
- maybe your measured \(x\) variables cannot explain enough variance in \(y\)?

no constant variance of residuals:
- some problems are already resolved if you do transformation / use a GLM
- it is possible to model \(\sigma\) dependent on \(x\)

residuals are not independent (are autocorrelated):
- maybe you have to look for a different model, e.g. time series models

strong outliers:
- check if there is a data issue, e.g. a typo
- it is possible to use instead of the Normal error distribution the Student’s t-distribution (“robust” regression), the Student’s t-distribution has an additional parameter that controls how heavy the tails are

Linear models - Assumptions

1 Reproduce slides

Plot fitted model

Compare normal distribution with kernel density estimate

Residual plot

2 Overview about the assumptions of linear models

3 Why do we (should) care about the assumptions?

4 Residuals should follow a Normal distribution

4.1 Everything is fine

4.2 Residuals with different distribution I

4.3 Residuals with different distribution II

4.4 (Too) few data points

5 Constant variance and no autocorrelation of residuals

5.1 Homoscedasticity (constant variance of residuals)

5.2 Hetereoscedasticity I

5.3 Hetereoscedasticity II

5.4 Autocorrelation of residuals

6 No strong outliers

7 Convenience function in R

8 What to do if you notice violations?

7 Convenience function in `R`