More exercises with linear models

1 Exercises

  • the first exercise has one numerical and one categorical predictor
  • the second exercise has two categorical predictors
  • the third exercise has two numerical predictors
  • in the fourth exercise you have to create your own dataset

1.1 Bacteria density

We will use the artificial 06_bacteria.csv dataset. It contains the density (µgC/ml) of three different bacteria species and the temperature (°C) of the water.

  1. Fit a linear model without interaction.
  2. Write down the equation of the model.
  3. Check model assumptions.
  4. Plot predictions along with the data.
  5. Fit a linear model with interaction.
  6. Write down the equation of the model.
  7. Check model assumptions.
  8. Plot predictions along with the data. Try out different colours or facets for the different species. Which model fits better to the data? Why? What does the interaction mean in this model?

Solution for exercise 1

1.2 Maples health in response calcium addition and elevation

We will use the 06_maples.csv dataset on the health of maples after calcium addition. The dataset contains the stem length of the seedlings (mm, stem_length), the elevation (low/high) and the calcium addition (watershed, W1 with calcium addition, reference withouth calcium addition).

  1. Rename the watershed variable and rename the factor levels of the calcium addition variable.
  2. Select only the colums that you need for your analysis.
  3. Fit a linear model with elevation and calcium addition as predictor and the stem length as dependent variable.
  4. Write down the equation of the model.
  5. Check model assumptions.
  6. Plot predictions along with the data (bonus: add prediction interval and confidence interval).
  7. Fit a model with interaction of elevation and calcium addition.
  8. Write down the equation of the model.
  9. Check which model fits better to the data and why?

Solution for exercise 2

1.3 Bonus: Crab size along a latitudinal gradient

We will use the 06_crab_size.csv dataset. You can find the documentation here. The dataset contains the carapace width (mm) of crabs, the mean annual air and water temperature (°C) and the latitude (degree) of the sites where the crabs were sampled.

  1. Plot the water temperature against the latitude.
  2. Fit a linear model with the water temperature and latitute as predictor and the carapace width as dependent variable.
  3. Write down the equation of the model.
  4. Check model assumptions.
  5. Plot marginal effects of the two variables. You make predictions for one variable at a time. The other variable should be held constant at their mean value.
  6. Did you notice any violations of the model assumptions? What could you do to improve the model?
  7. Do you have an explanation for the coefficient value of the slope of the water temperature?
  8. Bonus: plot the effect of both predictors in one plot with geom_contour or geom_raster.

Solution for exercise 3

1.4 Bonus: Fit a linear model to an artificial dataset

Later, you will fit complex models to data that have a complicated structure. It is often a good idea to first create a toy dataset, which has the same structure as your real data. This allows you to test your code and to see if you can recover the parameters of the model.

  1. Create a dataset that comes from a linear model with one numerical predictor variable. You have to specify values (just make up some values) for the intercept, the slope, and the standard deviation of the residuals. Moreover, you need some x values of the numerical predictor. You can use the function rnorm to create random numbers from a normal distribution.
  2. Fit a linear model to the data.
  3. Check model assumptions.
  4. Check if you can recover the parameters of the model.
  5. Plot your data along with the regression line (bonus: prediction interval and confindence interval).
  6. Bonus: vary the standard deviation of the residuals and the sample size. What happens to the parameter estimates and the prediction interval?

This gets you started to create the dataset:

alpha <- 200
beta <- -3
sigma_val <- 10
nvalues <- 100

df <- tibble(
    x = seq(0, 20, length = nvalues),
    y = alpha + beta * x + rnorm(nvalues, mean = 0, sd = sigma_val))

Solution for exercise 4