Contrast coding for linear models

Day 5

Felix Nößler

Freie Universität Berlin @ Theoretical Ecology

January 19, 2024

Reproduce slides

1 What is contrast coding?

  • categorical variables have to be transformed into numerical variables for linear models

  • there are different ways to code discrete categorical variables into numerical variables → contrast coding

  • we cover three types of contrast coding:

    • treatment contrasts
    • sum contrasts
    • cell mean parametrization

1.1 Default: treatment contrasts

  • create a simple dataset with a categorical variable and three factors:
Code
# let's create a simple dataset with a categorical variable
df_treatment <- tibble(
    f = factor(rep(
        c("Control", "Treatment1", "Treatment2"), 
        each = 30)),
    y = rep(c(10, 20, 22), each = 30) + 
        rnorm(90, mean = 0, sd = 4)
)

lm1 <- lm(y ~ f, data = df_treatment)

ggplot(df_treatment, aes(f, y)) +
    geom_dotplot(binaxis = "y", 
                 stackdir = "center", 
                 dotsize = 0.5) +
    geom_hline(yintercept = coef(lm1)[1], color = "grey") +
    annotate("segment", x = 2.2, xend = 2.2, 
             y = coef(lm1)[1], yend = coef(lm1)[1] + coef(lm1)[2], 
             arrow = arrow(type = "closed", length = unit(0.02, "npc")),
             color = "blue", linetype = "dashed") +
    annotate("segment", x = 3.2, xend = 3.2, 
            y = coef(lm1)[1], yend = coef(lm1)[1] + coef(lm1)[3], 
            arrow = arrow(type = "closed", length = unit(0.02, "npc")),
            color = "red", linetype = "dashed") +
    labs(x = "")

  • fit a linear model:
lm(y ~ f, data = df_treatment) %>% coef
(Intercept) fTreatment1 fTreatment2 
  10.581249    9.934633   10.945073 
  • the values for Treatment1 and Treatment2 are the differences to the reference level Control and can be interpreted as an treatment effect \[ \begin{align} \text{Control:} \quad & \mu_{\text{control}} = \alpha \\ \text{Treatment 1:} \quad & \mu_{\text{treat1}} = \alpha + \beta_{\text{treat1}} \\ \text{Treatment 2:} \quad & \mu_{\text{treat2}} = \alpha + \beta_{\text{treat2}} \\ \end{align} \]

  • Hypothesis: Are the means of the treatment groups different from the control group?

1.2 Sum contrasts

df_treatment <- mutate(df_treatment, f = factor(f, levels = c("Treatment1", "Treatment2", "Control")))

f_sum_contrast <- contr.sum(3)
colnames(f_sum_contrast) <- c("Treatment1", "Treatment2")

lm_sum_contr <- lm(y ~ f, data = df_treatment,
   contrasts = list(f = f_sum_contrast)) 

lm_sum_contr %>% coef
(Intercept) fTreatment1 fTreatment2 
  17.541151    2.974731    3.985171 
  • the intercept \(\alpha\) is the grand mean of all groups

  • the coefficients \(\beta\) are the differences to the grand mean \[ \begin{align} \text{Control:} \quad & \mu_{\text{control}} = \alpha + -\beta_{\text{treat1}} -\beta_{\text{treat2}} \\ \text{Treatment 1:} \quad & \mu_{\text{treat1}} = \alpha + \beta_{\text{treat1}} \\ \text{Treatment 2:} \quad & \mu_{\text{treat2}} = \alpha + \beta_{\text{treat2}} \\ \end{align} \]

  • Hypothesis: Are the mean from the treatment groups different from the grand mean?

Code
ggplot(df_treatment, aes(f, y)) +
    geom_dotplot(binaxis = "y", 
                 stackdir = "center", 
                 dotsize = 0.5) +
    geom_hline(yintercept = coef(lm_sum_contr)[1], color = "grey") +
    annotate("segment", x = 1.2, xend = 1.2, 
             y = coef(lm_sum_contr)[1], 
             yend = coef(lm_sum_contr)[1] + coef(lm_sum_contr)[2], 
             arrow = arrow(type = "closed", length = unit(0.02, "npc")),
             color = "blue", linetype = "dashed") +
    annotate("segment", x = 2.2, xend = 2.2, 
            y = coef(lm_sum_contr)[1], 
            yend = coef(lm_sum_contr)[1] + coef(lm_sum_contr)[3], 
            arrow = arrow(type = "closed", length = unit(0.02, "npc")),
            color = "red", linetype = "dashed") +
    labs(x = "")

1.3 Cell mean parametrization

lm(y ~ -1 + f, data = df_treatment) %>% coef
fTreatment1 fTreatment2    fControl 
   20.51588    21.52632    10.58125 
  • the intercepts \(\alpha\) are the mean values of the groups:

\[ \begin{align} \text{Control:} \quad & \mu_{\text{control}} = \alpha_{\text{control}} \\ \text{Treatment 1:} \quad & \mu_{\text{treat1}} = \alpha_{\text{treat1}} \\ \text{Treatment 2:} \quad & \mu_{\text{treat2}} = \alpha_{\text{treat2}} \\ \end{align} \]

  • is not useful for hypothesis testing

1.4 When does it matter?

  • the model predictions are the same for all three contrast coding schemes for standard linear models
  • the interpretation of the coefficients is different → this matters for hypothesis testing
  • for regularization (shrinkage of the model coefficients) the contrast coding scheme matters, we will cover this later