Model selection and hypothesis testing
1 Lecture
1.1 Technical overview
1.2 Hypothesis testing for linear models in R
1.3 Regularization for linear models in R
2 Exercise
- the first exercise deals with hypothesis testing with categorical predictor variables
- the second exercise deals with hypothesis testing with interaction terms
- the third exercise deals with regularization with numerical predictor variables
- the fourth exercise deals with regularization with categorical predictor variables and the difficulty to compare models with different parametrisations (contrast coding)
2.1 Likelihood-ratio test with the maples dataset
We use again the maples dataset from the previous exercise (06_maples.csv). This codes gets you started (it was already part of the solution of the last exercise):
## Load dataset
df_maples_prep <- read_csv("data/06_maples.csv")
## Apply transformation
df_maples <- df_maples_prep %>%
mutate(calcium_addition = factor(watershed,
levels = c("Reference", "W1"),
labels = c("Control", "Ca addition"))) %>%
select(stem_length, elevation, calcium_addition)
## Fit linear model
lm_maples <- lm(stem_length ~ elevation + calcium_addition,
data = df_maples)
## Check assumption
autoplot(lm_maples, which = 1:2)- Test the following hypothesis with a F-test:
\[ \begin{align} H_0: \text{stemlength} &= \alpha_{\text{elevation}} + \epsilon \\ H_1: \text{stemlength} &= \alpha_{\text{elevation}} + \alpha_{\text{calcium-treatment}} + \epsilon \end{align} \]
- Use
drop1to get the F value, degrees of freedom and the p-value, usesummaryto get the effect size and the standard error. - Bonus: Add lines to represent the effect size in the plot from the solution to the last exercise.
2.2 Likelihood-ratio test with the bacteria dataset
We use again the bacteria dataset from the previous exercise (06_bacteria.csv). This codes gets you started (it was already part of the solution of the last exercise):
library(dplyr)
library(readr)
library(ggplot2)
library(ggfortify)
## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))
df_bacteria <- read_csv("data/06_bacteria.csv")
## fit model with interaction
lm_bacteria <- lm(density ~ species * temperature, data = df_bacteria)
## check assumptions
autoplot(lm_bacteria, which = 1:2)- Test the following hypothesis that the effect of temperature (the slope) on the bacteria growth is species dependent:
\[ \begin{align} H_0: \text{density} &= \alpha_{\text{species}} + \beta_{\text{temperature}} \cdot \text{temperature} + \epsilon \\ H_1: \text{density} &= \alpha_{\text{species}} + (\beta_{\text{temperature}} + \beta_{\text{temperature:species}}) \cdot \text{temperature}+ \epsilon \end{align} \]
2.3 Regularization with the crab dataset
We will reuse the crab dataset from the previous exercise (06_crab_size.csv).
- Load the dataset and fit a linear model with all the the numerical variables (
latitude,air_temp,air_temp_sd,water_temp,water_temp_sd) and the interaction oflatitudeandair_tempandlatitudeandwater_tempand withsizeas the response variable. - load
glmnetUtilsand fit an elastic net model with cross-validation:- we use the function
cva.glmnetthat does cross validation for different values of alpha and lambda - use the same formula that you have use to fit the ordinary least squares linear model
- use the following alpha values:
c(0, 0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1) - you can keep the lambda values as default
- use 20-fold cross-validation (
nfolds = 20)
- we use the function
- Plot the resulting object. Which alpha values are the best? If it’s hard to see, you can use the function
minlossplotto get a better view of the minimal loss for each alpha value. What does an alpha values of 0 and 1 mean? - Plot your cross validation error against lambda values. We can use the same object that you already created by
cva.glmnet. We have to subset the list, to the alpha value that you selected. You have to adaptelasticnet_obj,selected_alphaandalpha_valsto your object names in the following code:
elasticnet_obj$modlist[[which(selected_alpha == alpha_vals)]]- What does the first and the second dashed vertical line mean? You can check the help page of
cv.glmnet. What islambda.1seandlambda.min? - Visualize the effect of lambda on the values of the predictor variables. Fit again an elastic net model, but this time we use the function
glmnetdirectly and specifyalphato the value that we have selected by the cross validation. You can then runplot(elnet_obj, xvar = "lambda", label = T), changeelnet_objto your object name. - Extract the fitted coefficients from the cross validated object. You can use the
coeffunction and you have to specifylambda(the argument is unfortunately nameds) andalpha. You can specify that you want thelambdathat is one standard error away from the minimal loss bys = "lambda.1se". Do you have coefficients that are zero? How many? - Compare the coefficient values with the ones from the ordinary least squares model. Do they differ? Why?
- Now, we want to plot marginal effects. This means, that we set all but one predictor variables to their mean and one predictor variable is varied between the minimum and the maximum of the original data. Start by calculating this for
air_temp(bonus: for more, or all variables):- Start by calculating the mean of each predictor variable and store the result in a
tibble. - Use
select(your_tibble, -your_variable_of_interest)to remove the variable of interest from the tibble. - Use
expand_gridand add your variable (either usec()orseq()) and the tibble without the variable to create all combinations. - Add the predictions. You can use:
to add themutate(size_pred = predict(your_elasticnet_cv_model, newdata = ., s = "lambda.1se", alpha = your_selected_alpha)[,1])size_predcolumn to your tibble.- Use this tibble to make a plot with ggplot. You can add the data points to the plot.
- Start by calculating the mean of each predictor variable and store the result in a
2.4 Bonus: Regularization with the penguins dataset
We will reuse the penguins dataset from the previous exercise (05_penguins.csv).
- Drop all rows that NaN values, you can use
drop_na(). - Create a column that combines the
speciesand thesexcolumn and name itspecies_sex. You can useunite(). Convert the resulting column to a factor. - Center the year column by subtracting the mean.
- Fit a ordinary least squares linear model with
body_mass_gas the response and centeredyear,species_sexas the predictor variables. Check that model assumptions are fulfilled. - Similar to the last exercise, we will fit an elastic net model and use cross validation to find the best alpha and lambda values. Use the same alpha values as in the last exercise and fit alpha and lambda with 15-fold cross-validation.
- Find a good alpha value.
- Plot the cross validation error against lambda values.
- Visualize the effect of lambda on the values of the predictor variables (see last exercise).
- Make predictions (bonus: compare elastic net predictions with OLS predictions) for all combinations of the centered year and the species/sex column. You can use
expand_gridandpredictto do this. - Bonus: Compare the coefficient values of the OLS and the elastic net model. First, you have to use the same parametrisation. You have to change the
species_sexcolumn to sum contrats (the default is treatment contrasts) and fit the models again. Additionally, you can directly specify alpha incv.glmnetand use the argumentuse.model.frame = Tto use the changed contrasts. What is the difference between sum and treatment contrasts? Why can’t we use the default parametrisation ofglmnetin ‘lm’?
You can use the following code to change the contrasts of the species_sex column to sum contrasts:
contr_species_sex <- contr.sum(6)
colnames(contr_species_sex) <- levels(df_penguins$species_sex)[1:5]
contrasts(df_penguins$species_sex) <- contr_species_sex