Contrast coding for linear models

Day 5

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 19, 2024

Reproduce slides

library(dplyr)
library(forcats)
library(ggplot2)

## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 16))

1 What is contrast coding?

categorical variables have to be transformed into numerical variables for linear models
there are different ways to code discrete categorical variables into numerical variables → contrast coding
we cover three types of contrast coding:
- treatment contrasts
- sum contrasts
- cell mean parametrization

1.1 Default: treatment contrasts

create a simple dataset with a categorical variable and three factors:

Code

# let's create a simple dataset with a categorical variable
df_treatment <- tibble(
    f = factor(rep(
        c("Control", "Treatment1", "Treatment2"), 
        each = 30)),
    y = rep(c(10, 20, 22), each = 30) + 
        rnorm(90, mean = 0, sd = 4)
)

lm1 <- lm(y ~ f, data = df_treatment)

ggplot(df_treatment, aes(f, y)) +
    geom_dotplot(binaxis = "y", 
                 stackdir = "center", 
                 dotsize = 0.5) +
    geom_hline(yintercept = coef(lm1)[1], color = "grey") +
    annotate("segment", x = 2.2, xend = 2.2, 
             y = coef(lm1)[1], yend = coef(lm1)[1] + coef(lm1)[2], 
             arrow = arrow(type = "closed", length = unit(0.02, "npc")),
             color = "blue", linetype = "dashed") +
    annotate("segment", x = 3.2, xend = 3.2, 
            y = coef(lm1)[1], yend = coef(lm1)[1] + coef(lm1)[3], 
            arrow = arrow(type = "closed", length = unit(0.02, "npc")),
            color = "red", linetype = "dashed") +
    labs(x = "")

fit a linear model:

lm(y ~ f, data = df_treatment) %>% coef

(Intercept) fTreatment1 fTreatment2 
  10.581249    9.934633   10.945073

the values for Treatment1 and Treatment2 are the differences to the reference level Control and can be interpreted as an treatment effect \[ \begin{align} \text{Control:} \quad & \mu_{\text{control}} = \alpha \\ \text{Treatment 1:} \quad & \mu_{\text{treat1}} = \alpha + \beta_{\text{treat1}} \\ \text{Treatment 2:} \quad & \mu_{\text{treat2}} = \alpha + \beta_{\text{treat2}} \\ \end{align} \]
Hypothesis: Are the means of the treatment groups different from the control group?

1.2 Sum contrasts

df_treatment <- mutate(df_treatment, f = factor(f, levels = c("Treatment1", "Treatment2", "Control")))

f_sum_contrast <- contr.sum(3)
colnames(f_sum_contrast) <- c("Treatment1", "Treatment2")

lm_sum_contr <- lm(y ~ f, data = df_treatment,
   contrasts = list(f = f_sum_contrast)) 

lm_sum_contr %>% coef

(Intercept) fTreatment1 fTreatment2 
  17.541151    2.974731    3.985171

the intercept \(\alpha\) is the grand mean of all groups
the coefficients \(\beta\) are the differences to the grand mean \[ \begin{align} \text{Control:} \quad & \mu_{\text{control}} = \alpha + -\beta_{\text{treat1}} -\beta_{\text{treat2}} \\ \text{Treatment 1:} \quad & \mu_{\text{treat1}} = \alpha + \beta_{\text{treat1}} \\ \text{Treatment 2:} \quad & \mu_{\text{treat2}} = \alpha + \beta_{\text{treat2}} \\ \end{align} \]
Hypothesis: Are the mean from the treatment groups different from the grand mean?

Code

ggplot(df_treatment, aes(f, y)) +
    geom_dotplot(binaxis = "y", 
                 stackdir = "center", 
                 dotsize = 0.5) +
    geom_hline(yintercept = coef(lm_sum_contr)[1], color = "grey") +
    annotate("segment", x = 1.2, xend = 1.2, 
             y = coef(lm_sum_contr)[1], 
             yend = coef(lm_sum_contr)[1] + coef(lm_sum_contr)[2], 
             arrow = arrow(type = "closed", length = unit(0.02, "npc")),
             color = "blue", linetype = "dashed") +
    annotate("segment", x = 2.2, xend = 2.2, 
            y = coef(lm_sum_contr)[1], 
            yend = coef(lm_sum_contr)[1] + coef(lm_sum_contr)[3], 
            arrow = arrow(type = "closed", length = unit(0.02, "npc")),
            color = "red", linetype = "dashed") +
    labs(x = "")

1.3 Cell mean parametrization

lm(y ~ -1 + f, data = df_treatment) %>% coef

fTreatment1 fTreatment2    fControl 
   20.51588    21.52632    10.58125

the intercepts \(\alpha\) are the mean values of the groups:

\[ \begin{align} \text{Control:} \quad & \mu_{\text{control}} = \alpha_{\text{control}} \\ \text{Treatment 1:} \quad & \mu_{\text{treat1}} = \alpha_{\text{treat1}} \\ \text{Treatment 2:} \quad & \mu_{\text{treat2}} = \alpha_{\text{treat2}} \\ \end{align} \]

is not useful for hypothesis testing

1.4 When does it matter?

the model predictions are the same for all three contrast coding schemes for standard linear models
the interpretation of the coefficients is different → this matters for hypothesis testing
for regularization (shrinkage of the model coefficients) the contrast coding scheme matters, we will cover this later