Fit linear models in R

Day 5 and 6

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 19, 2024

Reproduce slides

1 Workflow

flowchart TD
    A[Design your study] --> B[Collect data]
    A -.->|already think of| C
    B --> C[Fit a model]
    C --> D[Access the model fit]
    D -.->|if it is not good| C
    D --> E[Inference / Hypothesis testing]
    D --> F[Prediction]

2 Example: Size of crab along a latitudinal gradient

The study wanted to analyse the Bergmann’s rule, that states that the body size of animals increases with latitude. They measured the size of the carapace at various locations along the eastern US coast. The data set contains the following variables:

  • latitude: a number denoting the latitude of salt marsh (degree)
  • size: a number denoting carapace width of a crab (millimeter)

the data set is taken from lterdatasampler package, see documentation.

Johnson, D. 2019. Fiddler crab body size in salt marshes from Florida to Massachusetts, USA at PIE and VCR LTER and NOAA NERR sites during summer 2016. ver 1. Environmental Data Initiative. doi:10.6073/pasta/4c27d2e778d3325d3830a5142e3839bb

2.1 Load data

df <- read_csv("data/06_crab_size.csv")
df
# A tibble: 392 × 9
   date       latitude site   size air_temp air_temp_sd water_temp water_temp_sd
   <date>        <dbl> <chr> <dbl>    <dbl>       <dbl>      <dbl>         <dbl>
 1 2016-07-24       30 GTM    12.4     21.8        6.39       24.5          6.12
 2 2016-07-24       30 GTM    14.2     21.8        6.39       24.5          6.12
 3 2016-07-24       30 GTM    14.5     21.8        6.39       24.5          6.12
 4 2016-07-24       30 GTM    12.9     21.8        6.39       24.5          6.12
 5 2016-07-24       30 GTM    12.4     21.8        6.39       24.5          6.12
 6 2016-07-24       30 GTM    13.0     21.8        6.39       24.5          6.12
 7 2016-07-24       30 GTM    10.3     21.8        6.39       24.5          6.12
 8 2016-07-24       30 GTM    11.2     21.8        6.39       24.5          6.12
 9 2016-07-24       30 GTM    12.7     21.8        6.39       24.5          6.12
10 2016-07-24       30 GTM    14.6     21.8        6.39       24.5          6.12
# ℹ 382 more rows
# ℹ 1 more variable: name <chr>

2.2 Fit a linear model

lm1 <- lm(size ~ latitude, data = df)
lm1

Call:
lm(formula = size ~ latitude, data = df)

Coefficients:
(Intercept)     latitude  
    -3.6244       0.4851  
summary(lm1)

Call:
lm(formula = size ~ latitude, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8376 -1.8797  0.1144  1.9484  6.9280 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.62442    1.27405  -2.845  0.00468 ** 
latitude     0.48512    0.03359  14.441  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.832 on 390 degrees of freedom
Multiple R-squared:  0.3484,    Adjusted R-squared:  0.3467 
F-statistic: 208.5 on 1 and 390 DF,  p-value: < 2.2e-16
  • write down the model equation: \[ \begin{align} y &= \alpha + \beta \cdot x + \epsilon \\ \text{size} &= 0.4851 \cdot \text{latitude} -3.6244 + \epsilon \\ \epsilon &\sim \mathcal{N}(0, 2.832) \end{align} \]
  • get the coefficient of determination \(R^2\):
TSS <- sum((df$size - mean(df$size))^2)
RSS <- sum(lm1$residuals^2)

1 - RSS / TSS
[1] 0.3484207

2.3 Analyse the model fit

alpha <- lm1$coefficients[1]
beta <- lm1$coefficients[2]

ggplot(df, aes(latitude, size)) +
  geom_point() +
  geom_abline(intercept = alpha, slope = beta,
              color = "red", linewidth = 1.5)

2.4 Check assumptions

autoplot(lm1, which = 1:2)

2.5 Next steps

you can do predictions, test hypotheses, …

3 Different ways to make predictions

We created a linear model:

df <- tibble(x = seq(1, 20, length = 50),
             y = rnorm(50, 2.0 * x + 5.0, 4.0))
lm1 <- lm(y ~ x, data = df)

3.1 For same data or manually

  • predictions for the same values of the predictor variables:
predict(lm1)  # or: fitted(lm1)
  • make predictions manually for new data (do it only for simple models, but it is good to understand the model):
alpha <- coef(lm1)[1]
beta <- coef(lm1)[2]
x <- c(1, 20)
alpha + beta * x

3.2 Predict for new data

  • predictions for new data with one predictor variable:
df_pred <- tibble(x = seq(10, 30, length = 20)) %>% 
    mutate(y_pred = predict(lm1, newdata = .))
  • predictions for new data with two or more predictor variables:
df_pred <- 
    expand_grid( 
        x = seq(10, 30, length = 20), 
        x2 = seq(2, 4, length = 20),
        x3 = c("site1", "site2", "site3")) %>% 
    mutate(y_pred = predict(lm1, newdata = .))

3.3 Add confidence and prediction intervals

  • notice that we use bind_cols to add several columns to the table
  • we rename the columns because the predict function always uses the same names
df_pred <- tibble(x = seq(10, 30, length = 20)) %>% 
  bind_cols(predict(lm1, newdata = ., interval = "confidence")) %>%
  select(-fit) %>%
  rename(upr_conf = upr, lwr_conf = lwr) %>%
  bind_cols(predict(lm1, newdata = ., interval = "prediction")) %>%
  rename(y_pred = fit, upr_pred = upr, lwr_pred = lwr) 
  
head(df_pred)
# A tibble: 6 × 6
      x lwr_conf upr_conf y_pred lwr_pred upr_pred
  <dbl>    <dbl>    <dbl>  <dbl>    <dbl>    <dbl>
1  10       23.4     25.6   24.5     16.7     32.3
2  11.1     25.5     27.7   26.6     18.8     34.4
3  12.1     27.5     29.7   28.6     20.8     36.4
4  13.2     29.5     31.9   30.7     22.8     38.5
5  14.2     31.4     34.0   32.7     24.9     40.5
6  15.3     33.3     36.2   34.8     26.9     42.6