Data transformation for linear models

1 Lecture

Slides in full screen

2 Exercise

2.1 Diatoms

We will use an artificial dataset to practice data transformation for linear models (07_diatoms.csv). The dataset contains the concentration of two diatom species over time. The diatoms were grown in two different pH levels.

  1. Create subset with species 1 and low pH only!
  2. Fit models of the diatom concentration (conc) over time (day)
  3. Apply useful transformation
  4. Follow the protocol on data transformation for linear models.

Solution for exercise 1

2.2 Bonus: Covid-19

We will use a dataset about Covid-19 cases in Germany (07_bl_infektionen.csv). The dataset contains the reported Covid-19 cases of the last 7 days (bl_inz) and it was downloaded from https://www.corona-daten-deutschland.de/dataset/infektionen_bundeslaender.

  1. Subset the data for the state of Berlin.
  2. Plot the reported Covid-19 cases over time.
  3. Do you see an exponential growth of the number of cases at some point?
  4. Subset the data to specific timepoints (for example: 11.03.2020 - 22.03.2020 and 15.10.2021 - 08.11.2021) and try to fit an exponential model to the data.
  5. Check the assmptions for a linear and an exponential model.
  6. Plot the raw data and the fitted model.

Solution for exercise 2

2.3 Bonus: Population growth

Use the dataset 07_population-and-demography.csv. The dataset contains the population of all countries over time. The dataset was downloaded from https://ourworldindata.org/population-growth

  1. Calculate the world population for each year.
  2. Plot the world population over time.
  3. Fit a linear model and an exponential model to the world population over time.
  4. Which models fits better? Why?
  5. Calculate the population growth rate for each year and plot it over time.

Use the group_by and summarise functions to calculate the world population for each year.

Or use summarise(df, Population = sum(Population), .by = "Year").

If you assume exponential growth, you can use the lm function to fit the model. The model is lm(log(Population) ~ year, data = df).

Use the lag function to calculate the population growth rate for each year. It is as simple as lag(Population) / Population.

Solution for exercise 3