Model selection and hypothesis testing

1 Lecture

1.1 Technical overview

Slides in full screen

1.2 Hypothesis testing for linear models in R

Slides in full screen

1.3 Regularization for linear models in R

Slides in full screen

2 Exercise

  • the first exercise deals with hypothesis testing with categorical predictor variables
  • the second exercise deals with hypothesis testing with interaction terms
  • the third exercise deals with regularization with numerical predictor variables
  • the fourth exercise deals with regularization with categorical predictor variables and the difficulty to compare models with different parametrisations (contrast coding)

2.1 Likelihood-ratio test with the maples dataset

We use again the maples dataset from the previous exercise (06_maples.csv). This codes gets you started (it was already part of the solution of the last exercise):

## Load dataset
df_maples_prep <- read_csv("data/06_maples.csv")

## Apply transformation
df_maples <- df_maples_prep %>%
  mutate(calcium_addition = factor(watershed, 
                                   levels = c("Reference", "W1"),
                                   labels = c("Control", "Ca addition"))) %>%
  select(stem_length, elevation, calcium_addition)
  
## Fit linear model
lm_maples <- lm(stem_length ~ elevation + calcium_addition, 
                data = df_maples)

## Check assumption
autoplot(lm_maples, which = 1:2)
  1. Test the following hypothesis with a F-test:

\[ \begin{align} H_0: \text{stemlength} &= \alpha_{\text{elevation}} + \epsilon \\ H_1: \text{stemlength} &= \alpha_{\text{elevation}} + \alpha_{\text{calcium-treatment}} + \epsilon \end{align} \]

  1. Use drop1 to get the F value, degrees of freedom and the p-value, use summary to get the effect size and the standard error.
  2. Bonus: Add lines to represent the effect size in the plot from the solution to the last exercise.

Solution for exercise 1

2.2 Likelihood-ratio test with the bacteria dataset

We use again the bacteria dataset from the previous exercise (06_bacteria.csv). This codes gets you started (it was already part of the solution of the last exercise):

library(dplyr)
library(readr)
library(ggplot2)
library(ggfortify)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))

df_bacteria <- read_csv("data/06_bacteria.csv")

## fit model with interaction
lm_bacteria <- lm(density ~ species * temperature, data = df_bacteria)

## check assumptions
autoplot(lm_bacteria, which = 1:2)
  1. Test the following hypothesis that the effect of temperature (the slope) on the bacteria growth is species dependent:

\[ \begin{align} H_0: \text{density} &= \alpha_{\text{species}} + \beta_{\text{temperature}} \cdot \text{temperature} + \epsilon \\ H_1: \text{density} &= \alpha_{\text{species}} + (\beta_{\text{temperature}} + \beta_{\text{temperature:species}}) \cdot \text{temperature}+ \epsilon \end{align} \]

Solution for exercise 2

2.3 Regularization with the crab dataset

We will reuse the crab dataset from the previous exercise (06_crab_size.csv).

  1. Load the dataset and fit a linear model with all the the numerical variables (latitude, air_temp, air_temp_sd, water_temp, water_temp_sd) and the interaction of latitude and air_temp and latitude and water_temp and with size as the response variable.
  2. load glmnetUtils and fit an elastic net model with cross-validation:
    • we use the function cva.glmnet that does cross validation for different values of alpha and lambda
    • use the same formula that you have use to fit the ordinary least squares linear model
    • use the following alpha values: c(0, 0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1)
    • you can keep the lambda values as default
    • use 20-fold cross-validation (nfolds = 20)
  3. Plot the resulting object. Which alpha values are the best? If it’s hard to see, you can use the function minlossplot to get a better view of the minimal loss for each alpha value. What does an alpha values of 0 and 1 mean?
  4. Plot your cross validation error against lambda values. We can use the same object that you already created by cva.glmnet. We have to subset the list, to the alpha value that you selected. You have to adapt elasticnet_obj, selected_alpha and alpha_vals to your object names in the following code:
elasticnet_obj$modlist[[which(selected_alpha == alpha_vals)]]
  1. What does the first and the second dashed vertical line mean? You can check the help page of cv.glmnet. What is lambda.1se and lambda.min?
  2. Visualize the effect of lambda on the values of the predictor variables. Fit again an elastic net model, but this time we use the function glmnet directly and specify alpha to the value that we have selected by the cross validation. You can then run plot(elnet_obj, xvar = "lambda", label = T), change elnet_obj to your object name.
  3. Extract the fitted coefficients from the cross validated object. You can use the coef function and you have to specify lambda (the argument is unfortunately named s) and alpha. You can specify that you want the lambda that is one standard error away from the minimal loss by s = "lambda.1se". Do you have coefficients that are zero? How many?
  4. Compare the coefficient values with the ones from the ordinary least squares model. Do they differ? Why?
  5. Now, we want to plot marginal effects. This means, that we set all but one predictor variables to their mean and one predictor variable is varied between the minimum and the maximum of the original data. Start by calculating this for air_temp (bonus: for more, or all variables):
    • Start by calculating the mean of each predictor variable and store the result in a tibble.
    • Use select(your_tibble, -your_variable_of_interest) to remove the variable of interest from the tibble.
    • Use expand_grid and add your variable (either use c() or seq()) and the tibble without the variable to create all combinations.
    • Add the predictions. You can use:
    mutate(size_pred = predict(your_elasticnet_cv_model, 
                            newdata = .,
                            s = "lambda.1se", 
                            alpha = your_selected_alpha)[,1])
    to add the size_pred column to your tibble.
    • Use this tibble to make a plot with ggplot. You can add the data points to the plot.

Solution for exercise 3

2.4 Bonus: Regularization with the penguins dataset

We will reuse the penguins dataset from the previous exercise (05_penguins.csv).

  1. Drop all rows that NaN values, you can use drop_na().
  2. Create a column that combines the species and the sex column and name it species_sex. You can use unite(). Convert the resulting column to a factor.
  3. Center the year column by subtracting the mean.
  4. Fit a ordinary least squares linear model with body_mass_g as the response and centered year, species_sex as the predictor variables. Check that model assumptions are fulfilled.
  5. Similar to the last exercise, we will fit an elastic net model and use cross validation to find the best alpha and lambda values. Use the same alpha values as in the last exercise and fit alpha and lambda with 15-fold cross-validation.
  6. Find a good alpha value.
  7. Plot the cross validation error against lambda values.
  8. Visualize the effect of lambda on the values of the predictor variables (see last exercise).
  9. Make predictions (bonus: compare elastic net predictions with OLS predictions) for all combinations of the centered year and the species/sex column. You can use expand_grid and predict to do this.
  10. Bonus: Compare the coefficient values of the OLS and the elastic net model. First, you have to use the same parametrisation. You have to change the species_sex column to sum contrats (the default is treatment contrasts) and fit the models again. Additionally, you can directly specify alpha in cv.glmnet and use the argument use.model.frame = T to use the changed contrasts. What is the difference between sum and treatment contrasts? Why can’t we use the default parametrisation of glmnet in ‘lm’?

You can use the following code to change the contrasts of the species_sex column to sum contrasts:

contr_species_sex <- contr.sum(6)
colnames(contr_species_sex) <- levels(df_penguins$species_sex)[1:5] 
contrasts(df_penguins$species_sex) <- contr_species_sex

Solution for exercise 4