Model selection

Day 7

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 19, 2024

Reproduce slides

library(dplyr)
library(ggplot2)
library(tidyr)
library(glmnetUtils)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))

## set seed for reproducibility (random numbers generation)
set.seed(1) 

1 Overview

  • Model selection can be done by three different approaches:
    • Information criteria
    • Likelihood-ratio test
    • Regularization
  • all of them use some sort of goodness-of-fit and penalize for model complexity

1.1 Principle of parsimony (Occam’s razor)

  • when two models explain the data equally well, the simpler model is preferred
  • the simpler model is the one with fewer parameters

2 Information criteria

  • it is possible to compare as many models as you want
  • the model with the lowest information criterion is the best
  • in ecology the AIC is most commonly used: \[ AIC = -2 \cdot \log(\mathcal{L}(y|\hat\theta)) + 2 \cdot k \] with \(\mathcal{L}(y|\hat\theta)\) being the likelihood of the model, \(\hat\theta\) the fitted parameters, \(y\) the data, and \(k\) the number of parameters

2.1 AIC corrected for small sample sizes

\[ AICc = AIC + \frac{2 \cdot k \cdot (k + 1)}{n - k - 1} \] with \(n\) being the sample size

  • use it when \(\frac{n}{k} < 40\)
  • for larger sample sizes the \(AICc\) converges to the \(AIC\)

K. P. Burnham, D. R. Anderson, and K. P. Huyvaert, “AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons,” 2010. doi: 10.1007/s00265-010-1029-6.

2.2 How to interpret the AIC(c)

  • rank the models by their AIC(c) values, start with the lowest
  • compare the the relative differences of the information criteria of the models
  • models with \(\Delta \leq 2\) are equally good

for example:

Model \(AICc\) \(\Delta AICc\)
1 100 0
2 102 2
3 110 10
4 140 40
  • model 1 is the best model, model 2 is equally good, model 3 has little support, and model 4 has close to zero support

3 Likelihood-ratio test

\[ \lambda_{LR} = -2 \cdot log\left( \frac{\mathcal{L}(y|\hat\theta_{0})} {\mathcal{L}(y|\hat\theta_{1})} \right) = -2 \cdot \left[ log(\mathcal{L}(y|\hat\theta_{0})) - log(\mathcal{L}(y|\hat\theta_{1}))\right] \\ \]

with \(k_1 - k_0\) degrees of freedom

  • \(\mathcal{L}(y|\hat\theta_{0})\) is the likelihood of the reduced model
  • \(\mathcal{L}(y|\hat\theta_{1})\) is the likelihood of the full model
  • \(k_0\) is the number of parameters of the reduced model
  • \(k_1\) is the number of parameters of the full model

3.1 F-Test

  • the F-Test is a special case of the likelihood-ratio test
  • the F-Test is used to compare the full model with the reduced model
  • the residual sum of squares (RSS) is used as a measure of goodness-of-fit

\[ F = \frac{(RSS_0 - RSS_1) / (k_1-k_0)}{RSS_1 / (n - k_1)} \]

with (\(k_1-k_0\), \(n-k_1\)) degrees of freedom

  • \(RSS_0\) and \(RSS_1\) are the RSS of the reduced and the full model
  • \(k_0\) and \(k_1\) are the number of parameters of the reduced and the full model
  • \(n\) is the sample size

4 Regularization

  • we will cover three different regularization methods:
    • Ridge regression
    • Lasso regression
    • Elastic net
  • the residual sum of squares (RSS, for LM) or deviance (for GLM) is used as a measure of goodness-of-fit
  • the regularization term (penalty) is added to the RSS

4.1 Ridge regression

  • the RSS is penalized by the sum of the squared coefficients

\[ \begin{align} RSS_{ridge} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \\ \hat{\beta}_{ridge} &= \underset{\beta}{\operatorname{argmin}} \left\{ RSS_{ridge} \right\} \end{align} \]

4.2 Lasso regression

  • the RSS is penalized by the sum of the absolute coefficients

\[ \begin{align} RSS_{lasso} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \\ \hat{\beta}_{lasso} &= \underset{\beta}{\operatorname{argmin}} \left\{ RSS_{lasso} \right\} \end{align} \]

4.3 Comparison of ridge and lasso regression

  • ridge regression shrinks the coefficients towards zero
  • lasso regression shrinks the coefficients towards zero and sets some coefficients to zero
Code
s <- 2
df <- tibble(
  x = seq(-10, 10, length.out = 400),
  Laplace = (dexp(x, 1/s) / 2)[c(400:201, 201:400)],
  Normal = dnorm(x, mean = 0, sd = s)) %>% 
  pivot_longer(cols = c(Laplace, Normal), 
               names_to = "Distribution", values_to = "y")

ggplot(df, aes(x, y, color = Distribution)) +
  geom_line(linewidth = 1)  

  • the lasso regression acts like a Bayesian model with a Laplace prior
  • the ridge regression acts like a Bayesian model with a Normal prior

4.4 Comparison of ridge and lasso regression

Code
df <- tibble(
  x1 = seq(2, 10, length.out = 50),
  x2 = rnorm(50, mean = 0, sd = 1),
  y = rnorm(length(x1), mean = 2 * x2))
  
fit_lasso <- cv.glmnet(y ~ x1 + x2, data = df, alpha = 1)  
fit_ridge <- cv.glmnet(y ~ x1 + x2, data = df, alpha = 0)
fit_lm <- lm(y ~ x1 + x2, data = df)

df %>%
  mutate(
    Lasso = predict(fit_lasso, newdata = df, s = "lambda.1se")[, 1],
    Ridge = predict(fit_ridge, newdata = df, s = "lambda.1se")[, 1],
    LM = predict(fit_lm, newdata = df)) %>%
  select(-y) %>% 
  pivot_longer(c(Lasso, Ridge, LM), names_to = "method") %>%
  mutate(method = factor(method, levels = c("LM", "Ridge", "Lasso"))) -> df_pred

ggplot(df, aes(x2, y)) +
  geom_point(size = 5, alpha = 0.7) +
  geom_line(aes(x2, value, color = method), data = df_pred, 
            linewidth = 1)

tibble(
  LM = coef(fit_lm)[-1],
  Ridge = coef(fit_ridge, s = "lambda.1se")[-1],
  Lasso = coef(fit_lasso, s = "lambda.1se")[-1])
# A tibble: 2 × 3
       LM   Ridge Lasso
    <dbl>   <dbl> <dbl>
1 -0.0308 -0.0238  0   
2  1.95    1.42    1.42

4.5 Elastic net

  • the RSS is penalized by a combination of the sum of the squared coefficients and the sum of the absolute coefficients

\[ \begin{align} RSS_{elastic} &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \left( \alpha \sum_{j=1}^{p} |\beta_j| + (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 \right) \\ \hat{\beta}_{elastic} &= \underset{\beta}{\operatorname{argmin}} \left\{ RSS_{elastic} \right\} \end{align} \]

  • if \(\alpha = 0\) the elastic net is equivalent to ridge regression
  • if \(\alpha = 1\) the elastic net is equivalent to lasso regression

4.6 How to choose \(\lambda\) and \(\alpha\)

  • \(\lambda\) and \(\alpha\) are called hyperparameters
  • they are chosen by cross-validation:
    • the (train) dataset is split several times into a training and a validation data set
    • for each split and each combination of \(\lambda\) and \(\alpha\) the model is fitted on the training data set and the prediction error is calculated on the validation data set
    • the average prediction error is calculated for each combination of \(\lambda\) and \(\alpha\)
    • the \(\lambda\) and \(\alpha\) values with the lowest average prediction error is used to fit the model on the whole (train) dataset

5 Guidance

Situation Goals What to do?
you have a clear hypothesis and have two competing models Inference use a likelihood-ratio test
(\(F\)- or \(\chi^2\)-test)
you have many variables and want to explore their association with \(y\) Exploration, Prediction Use regularization

5.1 Should I split the data into training and test data sets?

  • if you have enough data and want to make predictions: yes!
  • if you have little data or you want to do exploration or inference: no.