Model selection

Day 7

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 19, 2024

Reproduce slides

library(dplyr)
library(ggplot2)
library(tidyr)
library(glmnetUtils)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))

## set seed for reproducibility (random numbers generation)
set.seed(1)

1 Overview

Model selection can be done by three different approaches:
- Information criteria
- Likelihood-ratio test
- Regularization
all of them use some sort of goodness-of-fit and penalize for model complexity

1.1 Principle of parsimony (Occam’s razor)

when two models explain the data equally well, the simpler model is preferred
the simpler model is the one with fewer parameters

2 Information criteria

it is possible to compare as many models as you want
the model with the lowest information criterion is the best

in ecology the AIC is most commonly used: \[ AIC = -2 \cdot \log(\mathcal{L}(y|\hat\theta)) + 2 \cdot k \] with \(\mathcal{L}(y|\hat\theta)\) being the likelihood of the model, \(\hat\theta\) the fitted parameters, \(y\) the data, and \(k\) the number of parameters

2.1 AIC corrected for small sample sizes

\[ AICc = AIC + \frac{2 \cdot k \cdot (k + 1)}{n - k - 1} \] with \(n\) being the sample size

use it when \(\frac{n}{k} < 40\)
for larger sample sizes the \(AICc\) converges to the \(AIC\)

K. P. Burnham, D. R. Anderson, and K. P. Huyvaert, “AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons,” 2010. doi: 10.1007/s00265-010-1029-6.

2.2 How to interpret the AIC(c)

rank the models by their AIC(c) values, start with the lowest
compare the the relative differences of the information criteria of the models
models with \(\Delta \leq 2\) are equally good

for example:

Model	\(AICc\)	\(\Delta AICc\)
1	100	0
2	102	2
3	110	10
4	140	40

model 1 is the best model, model 2 is equally good, model 3 has little support, and model 4 has close to zero support

3 Likelihood-ratio test

\[ \lambda_{LR} = -2 \cdot log\left( \frac{\mathcal{L}(y|\hat\theta_{0})} {\mathcal{L}(y|\hat\theta_{1})} \right) = -2 \cdot \left[ log(\mathcal{L}(y|\hat\theta_{0})) - log(\mathcal{L}(y|\hat\theta_{1}))\right] \\ \]

with \(k_1 - k_0\) degrees of freedom

\(\mathcal{L}(y|\hat\theta_{0})\) is the likelihood of the reduced model
\(\mathcal{L}(y|\hat\theta_{1})\) is the likelihood of the full model
\(k_0\) is the number of parameters of the reduced model
\(k_1\) is the number of parameters of the full model

3.1 F-Test

the F-Test is a special case of the likelihood-ratio test
the F-Test is used to compare the full model with the reduced model
the residual sum of squares (RSS) is used as a measure of goodness-of-fit

\[ F = \frac{(RSS_0 - RSS_1) / (k_1-k_0)}{RSS_1 / (n - k_1)} \]

with (\(k_1-k_0\), \(n-k_1\)) degrees of freedom

\(RSS_0\) and \(RSS_1\) are the RSS of the reduced and the full model
\(k_0\) and \(k_1\) are the number of parameters of the reduced and the full model
\(n\) is the sample size

4 Regularization

we will cover three different regularization methods:
- Ridge regression
- Lasso regression
- Elastic net
the residual sum of squares (RSS, for LM) or deviance (for GLM) is used as a measure of goodness-of-fit
the regularization term (penalty) is added to the RSS

4.1 Ridge regression

the RSS is penalized by the sum of the squared coefficients

\[ \begin{align} RSS_{ridge} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \\ \hat{\beta}_{ridge} &= \underset{\beta}{\operatorname{argmin}} \left\{ RSS_{ridge} \right\} \end{align} \]

4.2 Lasso regression

the RSS is penalized by the sum of the absolute coefficients

\[ \begin{align} RSS_{lasso} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \\ \hat{\beta}_{lasso} &= \underset{\beta}{\operatorname{argmin}} \left\{ RSS_{lasso} \right\} \end{align} \]

4.3 Comparison of ridge and lasso regression

ridge regression shrinks the coefficients towards zero
lasso regression shrinks the coefficients towards zero and sets some coefficients to zero

Code

s <- 2
df <- tibble(
  x = seq(-10, 10, length.out = 400),
  Laplace = (dexp(x, 1/s) / 2)[c(400:201, 201:400)],
  Normal = dnorm(x, mean = 0, sd = s)) %>% 
  pivot_longer(cols = c(Laplace, Normal), 
               names_to = "Distribution", values_to = "y")

ggplot(df, aes(x, y, color = Distribution)) +
  geom_line(linewidth = 1)

the lasso regression acts like a Bayesian model with a Laplace prior
the ridge regression acts like a Bayesian model with a Normal prior

4.4 Comparison of ridge and lasso regression

Code

df <- tibble(
  x1 = seq(2, 10, length.out = 50),
  x2 = rnorm(50, mean = 0, sd = 1),
  y = rnorm(length(x1), mean = 2 * x2))
  
fit_lasso <- cv.glmnet(y ~ x1 + x2, data = df, alpha = 1)  
fit_ridge <- cv.glmnet(y ~ x1 + x2, data = df, alpha = 0)
fit_lm <- lm(y ~ x1 + x2, data = df)

df %>%
  mutate(
    Lasso = predict(fit_lasso, newdata = df, s = "lambda.1se")[, 1],
    Ridge = predict(fit_ridge, newdata = df, s = "lambda.1se")[, 1],
    LM = predict(fit_lm, newdata = df)) %>%
  select(-y) %>% 
  pivot_longer(c(Lasso, Ridge, LM), names_to = "method") %>%
  mutate(method = factor(method, levels = c("LM", "Ridge", "Lasso"))) -> df_pred

ggplot(df, aes(x2, y)) +
  geom_point(size = 5, alpha = 0.7) +
  geom_line(aes(x2, value, color = method), data = df_pred, 
            linewidth = 1)

tibble(
  LM = coef(fit_lm)[-1],
  Ridge = coef(fit_ridge, s = "lambda.1se")[-1],
  Lasso = coef(fit_lasso, s = "lambda.1se")[-1])

# A tibble: 2 × 3
       LM   Ridge Lasso
    <dbl>   <dbl> <dbl>
1 -0.0308 -0.0238  0   
2  1.95    1.42    1.42

4.5 Elastic net

the RSS is penalized by a combination of the sum of the squared coefficients and the sum of the absolute coefficients

\[ \begin{align} RSS_{elastic} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \left( \alpha \sum_{j=1}^{p} |\beta_j| + (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 \right) \\ \hat{\beta}_{elastic} &= \underset{\beta}{\operatorname{argmin}} \left\{ RSS_{elastic} \right\} \end{align} \]

if \(\alpha = 0\) the elastic net is equivalent to ridge regression
if \(\alpha = 1\) the elastic net is equivalent to lasso regression

4.6 How to choose \(\lambda\) and \(\alpha\)

\(\lambda\) and \(\alpha\) are called hyperparameters
they are chosen by cross-validation:
- the (train) dataset is split several times into a training and a validation data set
- for each split and each combination of \(\lambda\) and \(\alpha\) the model is fitted on the training data set and the prediction error is calculated on the validation data set
- the average prediction error is calculated for each combination of \(\lambda\) and \(\alpha\)
- the \(\lambda\) and \(\alpha\) values with the lowest average prediction error is used to fit the model on the whole (train) dataset

5 Guidance

Situation	Goals	What to do?
you have a clear hypothesis and have two competing models	Inference	use a likelihood-ratio test (\(F\)- or \(\chi^2\)-test)
you have many variables and want to explore their association with \(y\)	Exploration, Prediction	Use regularization

5.1 Should I split the data into training and test data sets?

if you have enough data and want to make predictions: yes!
if you have little data or you want to do exploration or inference: no.