library(dplyr)
library(ggplot2)
library(tidyr)
library(glmnetUtils)
## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))
## set seed for reproducibility (random numbers generation)
set.seed(1) Day 7
Freie Universität Berlin @ Theoretical Ecology
January 19, 2024
library(dplyr)
library(ggplot2)
library(tidyr)
library(glmnetUtils)
## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))
## set seed for reproducibility (random numbers generation)
set.seed(1) \[ AICc = AIC + \frac{2 \cdot k \cdot (k + 1)}{n - k - 1} \] with \(n\) being the sample size
K. P. Burnham, D. R. Anderson, and K. P. Huyvaert, “AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons,” 2010. doi: 10.1007/s00265-010-1029-6.
for example:
| Model | \(AICc\) | \(\Delta AICc\) |
|---|---|---|
| 1 | 100 | 0 |
| 2 | 102 | 2 |
| 3 | 110 | 10 |
| 4 | 140 | 40 |
\[ \lambda_{LR} = -2 \cdot log\left( \frac{\mathcal{L}(y|\hat\theta_{0})} {\mathcal{L}(y|\hat\theta_{1})} \right) = -2 \cdot \left[ log(\mathcal{L}(y|\hat\theta_{0})) - log(\mathcal{L}(y|\hat\theta_{1}))\right] \\ \]
with \(k_1 - k_0\) degrees of freedom
\[ F = \frac{(RSS_0 - RSS_1) / (k_1-k_0)}{RSS_1 / (n - k_1)} \]
with (\(k_1-k_0\), \(n-k_1\)) degrees of freedom
\[ \begin{align} RSS_{ridge} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \\ \hat{\beta}_{ridge} &= \underset{\beta}{\operatorname{argmin}} \left\{ RSS_{ridge} \right\} \end{align} \]
\[ \begin{align} RSS_{lasso} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \\ \hat{\beta}_{lasso} &= \underset{\beta}{\operatorname{argmin}} \left\{ RSS_{lasso} \right\} \end{align} \]
s <- 2
df <- tibble(
x = seq(-10, 10, length.out = 400),
Laplace = (dexp(x, 1/s) / 2)[c(400:201, 201:400)],
Normal = dnorm(x, mean = 0, sd = s)) %>%
pivot_longer(cols = c(Laplace, Normal),
names_to = "Distribution", values_to = "y")
ggplot(df, aes(x, y, color = Distribution)) +
geom_line(linewidth = 1) df <- tibble(
x1 = seq(2, 10, length.out = 50),
x2 = rnorm(50, mean = 0, sd = 1),
y = rnorm(length(x1), mean = 2 * x2))
fit_lasso <- cv.glmnet(y ~ x1 + x2, data = df, alpha = 1)
fit_ridge <- cv.glmnet(y ~ x1 + x2, data = df, alpha = 0)
fit_lm <- lm(y ~ x1 + x2, data = df)
df %>%
mutate(
Lasso = predict(fit_lasso, newdata = df, s = "lambda.1se")[, 1],
Ridge = predict(fit_ridge, newdata = df, s = "lambda.1se")[, 1],
LM = predict(fit_lm, newdata = df)) %>%
select(-y) %>%
pivot_longer(c(Lasso, Ridge, LM), names_to = "method") %>%
mutate(method = factor(method, levels = c("LM", "Ridge", "Lasso"))) -> df_pred
ggplot(df, aes(x2, y)) +
geom_point(size = 5, alpha = 0.7) +
geom_line(aes(x2, value, color = method), data = df_pred,
linewidth = 1)\[ \begin{align} RSS_{elastic} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \left( \alpha \sum_{j=1}^{p} |\beta_j| + (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 \right) \\ \hat{\beta}_{elastic} &= \underset{\beta}{\operatorname{argmin}} \left\{ RSS_{elastic} \right\} \end{align} \]
| Situation | Goals | What to do? |
|---|---|---|
| you have a clear hypothesis and have two competing models | Inference | use a likelihood-ratio test (\(F\)- or \(\chi^2\)-test) |
| you have many variables and want to explore their association with \(y\) | Exploration, Prediction | Use regularization |
Model selection