Hypothesis testing for linear models in `R`

Day 7

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 19, 2024

Reproduce slides

library(dplyr)
library(tidyr)
library(ggplot2)
library(ggfortify)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 14))

1 Introduction to the example

We want to analyse whether the species Quercus robur and Fagus sylvatica differ in their response to warming in spring.

We designed the following experiment:

30 pots of each species were grown outside
at the first of March we put all pots inside in a greenhouse in two chambers with different temperatures
the first chamber had a temperature similiar to outdoors, the other chamber was three degrees warmer
we measured the opening of the first leaf in spring in Growing degree days

\[ GDD = \sum_{t=t_0}^{t_1} T_{mean, t} - T_{base} \]

2 Hypothesis

Nullhypothesis: Quercus robur and Fagus sylvatica do not differ in their response to warming in spring.

\[ H_0: \text{GDD} = \beta_{\text{species}} + \beta_{\text{treatment}} + \epsilon \]

Alternative hypothesis: Quercus robur and Fagus sylvatica differ in their response to warming in spring.

\[ H_1: \text{GDD} = \beta_{\text{species}} + \beta_{\text{treatment}} + \beta_{\text{species-treament}} + \epsilon \]

3 Hypothesis as a graph

Null hypothesis

flowchart TB
    S --> R
    T --> R

    S[species]
    T[treatment]
    R[response]

Alternative hypothesis

flowchart TB
    S --> R
    T --> ST
    S --> ST
    ST --> R
    T --> R
    
    S[species]
    T[treatment]
    R[response]  
    ST[species-treatment]

4 Load data

This is a simplified version of the data from:

I. Beil, J. Kreyling, C. Meyer, N. Lemcke, and A. V. Malyshev, “Late to bed, late to rise—Warmer autumn temperatures delay spring phenology by delaying dormancy,” Global Change Biology, 2021. doi: 10.1111/gcb.15858.

Code

set.seed(123)
n_data <- 15
df_data <- expand_grid(
  species = factor(c("beech", "oak")), 
  treatment = factor(c("control", "warming"))) %>%
  slice(rep(1:n(), each = n_data)) %>%
  mutate(GDD = 
         ifelse(
           species == "beech", 
            ifelse(treatment == "control", 
                   rnorm(n_data, mean = 263, sd = 42), 
                   rnorm(n_data, mean = 197, sd = 20)), 
            ifelse(treatment == "control", 
                   rnorm(n_data, mean = 166, sd = 45), 
                   rnorm(n_data, mean = 116, sd = 44))))

5 Quick look at the data

df_data

# A tibble: 60 × 3
   species treatment   GDD
   <fct>   <fct>     <dbl>
 1 beech   control    239.
 2 beech   control    253.
 3 beech   control    328.
 4 beech   control    266.
 5 beech   control    268.
 6 beech   control    335.
 7 beech   control    282.
 8 beech   control    210.
 9 beech   control    234.
10 beech   control    244.
# ℹ 50 more rows

summarise the data:

summarise(df_data, GDD_mean = mean(GDD), GDD_sd = sd(GDD), .by = c(species, treatment))

# A tibble: 4 × 4
  species treatment GDD_mean GDD_sd
  <fct>   <fct>        <dbl>  <dbl>
1 beech   control       269.   35.5
2 beech   warming       192.   21.9
3 oak     control       179.   39.0
4 oak     warming       119.   35.8

6 Visualize data

ggplot(df_data, aes(x = treatment, y = GDD)) +
  geom_jitter(alpha = 0.5, size = 3, width = 0.1, height = 0) +
  facet_wrap(~species) +
  theme_light() +
  theme(text = element_text(size = 14))

7 Fit the model and model equation

lm1 <- lm(GDD ~ species + treatment + species:treatment, data = df_data)
lm1


Call:
lm(formula = GDD ~ species + treatment + species:treatment, data = df_data)

Coefficients:
                (Intercept)                   speciesoak  
                     269.40                       -90.11  
           treatmentwarming  speciesoak:treatmentwarming  
                     -77.33                        16.74

\[ \text{GDD} = \left\{ \begin{array}{ll} 269.40 +\epsilon & \text{if control, beech} \\ 269.40 -90.11 +\epsilon & \text{if control, oak} \\ 269.40 -77.33 +\epsilon& \text{if warming, beech} \\ 269.40 -90.11 -77.33 + 16.74 +\epsilon& \text{if warming, oak} \\ \end{array}\right. \]

8 Model equation without interaction

lm(GDD ~ species + treatment, data = df_data)


Call:
lm(formula = GDD ~ species + treatment, data = df_data)

Coefficients:
     (Intercept)        speciesoak  treatmentwarming  
          265.21            -81.74            -68.96

\[ \text{GDD} = \left\{ \begin{array}{ll} 265.21 +\epsilon & \text{if control, beech} \\ 265.21 -81.74 +\epsilon & \text{if control, oak} \\ 265.21 -68.96 +\epsilon& \text{if warming, beech} \\ 265.21 -81.74 -68.96 +\epsilon& \text{if warming, oak} \\ \end{array}\right. \]

9 Check assumptions

autoplot(lm1, which = 1:2)

10 Do the hypothesis test

drop1(lm1, test = "F")

Single term deletions

Model:
GDD ~ species + treatment + species:treatment
                  Df Sum of Sq   RSS    AIC F value Pr(>F)
<none>                         63632 425.99               
species:treatment  1    1051.5 64683 424.97  0.9254 0.3402

how to read the output:

the first row (<none> \(\rightarrow\) nothing is removed) is the full model
the second row is the reduced model without the species:treatment interaction
df: change of the degrees of freedom from the full model, i.e. the number of parameters that were dropped
Sum of Sq: change of the sum of squares from the full model to the reduced model
RSS: residual sum of squares

\(\rightarrow\) we cannot reject the null hypothesis that the species do not differ in their response to warming in spring (F = 0.92, df = 1, p = 0.34).

11 Hypothesis test by hand

lm_full <- lm1
lm_reduced <- lm(GDD ~ species + treatment, data = df_data)

# number of observations
n <- nrow(df_data) 

# sum of squares
RSS_full <- sum(resid(lm_full)^2)
RSS_reduced <- sum(resid(lm_reduced)^2)
sum_sq <- RSS_reduced - RSS_full

# number of parameters
p_reduced <- length(coef(lm_reduced))
p_full <- length(coef(lm_full))

# F statistic
F <- (sum_sq / (p_full - p_reduced ))  / (RSS_full / (n - p_full)) 

# degrees of freedom
df1 <- p_full - p_reduced
df2 <- n - p_full

# p-value
pf(F, df1 = df1, df2 = df2, lower.tail = FALSE)

[1] 0.3402041

12 Notes

if you have several hypothesis tests, you should state that you do an exploratory analysis and correct for multiple testing
for exploratory analysis it is more likely that you find a significant result by chance (spurious relationship)
if you have observational data you should be careful with hypothesis testing and causal reasoning, because you cannot control for all confounding variables

Hypothesis testing for linear models in R