library(dplyr)
library(ggplot2)
library(ggfortify)
## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))
## set seed for reproducibility (random numbers generation)
set.seed(1) Day 5 and 6
Freie Universität Berlin @ Theoretical Ecology
January 19, 2024
library(dplyr)
library(ggplot2)
library(ggfortify)
## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))
## set seed for reproducibility (random numbers generation)
set.seed(1) plot_model <- function(data) {
ggplot(data, aes(x, y)) +
geom_point(size = 2) +
geom_ribbon(aes(ymin = y_pred - y_pred_se, ymax = y_pred + y_pred_se),
alpha = 0.3) +
geom_line(aes(y = y_pred), color = "red", linewidth = 1)
}normal_vs_kernel_density <- function(df, sd_residual) {
residual_df <- tibble(
x = seq(qnorm(0.005, mean = 0, sd = sd_residual),
qnorm(0.995, mean = 0, sd = sd_residual),
length.out = 100),
y = dnorm(x,
mean = 0, sd = sd_residual)
)
ggplot() +
geom_point(aes(residual, y = 0), data = df,
alpha = 0.5) +
geom_density(aes(residual, fill = "Kernel density"), data = df,
linewidth = 0.5, alpha = 0.1) +
geom_ribbon(aes(x, ymax = y, fill = "Normal distribution"),
data = residual_df, ymin = 0,
linewidth = 0.5, color = "black", alpha = 0.2) +
labs(fill = "Distribution",
y = "density")
}plot_residuals <- function(df, sd_residual) {
df <- df %>% arrange(residual) %>%
mutate(residuals_quantiles = 1:n() / n(),
theoretical_quantiles = qnorm(residuals_quantiles,
mean = 0,
sd = sd_residual))
ggplot(df, aes(theoretical_quantiles, residual)) +
geom_point(size = 2, alpha = 0.5) +
geom_abline(slope = 1, intercept = 0,
color = "red", linewidth = 1) +
labs(x = "theoretical residual",
y = "actual residual") +
coord_fixed()
}The residuals \(\epsilon\) of a linear model should follow a Normal distribution with mean 0 and standard deviation of \(\sigma\):
\[ \begin{align} \mu &= \alpha + \beta \cdot x \quad (\text{or: } \hat y) \\ y &= \mu + \epsilon \\ \epsilon &\sim N(0, \sigma) \end{align} \] you can also write the model in the following way:
\[ y \sim N(\alpha + \beta \cdot x, \sigma) \]
normal_vs_kernel_density(df1, sd_residual1)plot_residuals(df1, sd_residual1)df_other_dist <- tibble(x = seq(0, 10, length.out = 150),
y = 0.8 * x + 1 + rexp(150, rate = 2))
lm1 <- lm(y ~ x, data = df_other_dist)
df_other_dist <- df_other_dist %>%
mutate(
y_pred = predict(lm1),
y_pred_se = predict(lm1, se.fit = T)$se.fit,
residual = y - y_pred)
sd_residual_other_dist <- sd(df_other_dist$residual)
plot_model(df_other_dist)normal_vs_kernel_density(df_other_dist, sd_residual_other_dist)plot_residuals(df_other_dist, sd_residual_other_dist)The noise of the data is generated here with an Exponential distribution. Therefore the distribution of the residuals is strongly right skewed.
df_other_dist2 <- tibble(x = seq(0, 10, length.out = 150),
y = 0.8 * x + 1 + rt(150, df = 2))
lm1 <- lm(y ~ x, data = df_other_dist2)
df_other_dist2 <- df_other_dist2 %>%
mutate(
y_pred = predict(lm1),
y_pred_se = predict(lm1, se.fit = T)$se.fit,
residual = y - y_pred)
sd_residual_other_dist2 <- sd(df_other_dist2$residual)
plot_model(df_other_dist2)normal_vs_kernel_density(df_other_dist2, sd_residual_other_dist2)plot_residuals(df_other_dist2, sd_residual_other_dist2)The noise of the data is generated here with a Student’s t-distribution with 2 degrees of freedom. Therefore, unlike most data points, we have a few points that are quite far away from the mean (too heavy tails). The points look like outliers.
normal_vs_kernel_density(df_few, sd_residual_few)plot_residuals(df_few, sd_residual_few)Even if the residuals are normally distributed in this example, it is often hard to tell if the residuals are normally distributed if you have only a few data points. You should try to get more data points, when you want to use a linear model.
Recall: \[ \begin{align} \mu &= \alpha + \beta \cdot x \quad (\text{or: } \hat y) \\ y &= \mu + \epsilon \\ \epsilon &\sim N(0, \sigma) \end{align} \] → \(\sigma\) is not dependent on \(x\) (homoscedasticity).
ggplot(df_homo, aes(x = y_pred, y = residual)) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
geom_point() +
labs(x = "predicted y", y = "residuals")The variance of the residuals is independent of \(\hat y\), everything is fine.
ggplot(df_hetero, aes(x = y_pred, y = residual)) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
geom_point() +
labs(x = "predicted y", y = "residuals")The variance of the residuals is dependent on \(\hat y\). This violates the assumption of homoscedasticity.
ggplot(df_hetero2, aes(x = y_pred, y = residual)) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
geom_point() +
labs(x = "predicted y", y = "residuals")The variance of the residuals is dependent on \(\hat y\). This violates the assumption of homoscedasticity.
To generate this example, we used a Poisson error distribution for \(y\). Therefore, using a GLM with a Poisson error distribution would be a better choice. When we use the Poisson error distribution, it is directly assumed that the variance of the residuals is dependent on \(x\). It is important to model this explicitly, because otherwise the uncertainty of your model predictions are not correct.
df_auto <- tibble(x = seq(0, 10, length.out = 100),
y = rnorm(100, mean = 5 * x, sd = 15))
## we add here some autocorrelation
for (i in 2:nrow(df_auto)) {
df_auto$y[i] <- df_auto$y[i] + 0.9 * df_auto$y[i - 1]
}
lm1 <- lm(y ~ x, data = df_auto)
df_auto <- df_auto %>%
mutate(
y_pred = predict(lm1),
y_pred_se = predict(lm1, se.fit = T)$se.fit,
residual = y - y_pred)
plot_model(df_auto)ggplot(df_auto, aes(x = y_pred, y = residual)) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_segment(aes(xend = y_pred, yend = 0), color = "black", alpha = 0.5) +
geom_point() +
labs(x = "predicted y", y = "residuals")The residuals are autocorrelated (you can see a pattern in the residuals). Here, it is better to use a time series model. A time series model can take into account the autocorrelation of the residuals.
R
Assumptions of linear models