library(dplyr)
library(forcats)
library(ggplot2)
## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 16))Day 5
Freie Universität Berlin @ Theoretical Ecology
January 19, 2024
library(dplyr)
library(forcats)
library(ggplot2)
## theme for ggplot
theme_set(theme_classic())
theme_update(text = element_text(size = 16))categorical variables have to be transformed into numerical variables for linear models
there are different ways to code discrete categorical variables into numerical variables → contrast coding
we cover three types of contrast coding:
# let's create a simple dataset with a categorical variable
df_treatment <- tibble(
f = factor(rep(
c("Control", "Treatment1", "Treatment2"),
each = 30)),
y = rep(c(10, 20, 22), each = 30) +
rnorm(90, mean = 0, sd = 4)
)
lm1 <- lm(y ~ f, data = df_treatment)
ggplot(df_treatment, aes(f, y)) +
geom_dotplot(binaxis = "y",
stackdir = "center",
dotsize = 0.5) +
geom_hline(yintercept = coef(lm1)[1], color = "grey") +
annotate("segment", x = 2.2, xend = 2.2,
y = coef(lm1)[1], yend = coef(lm1)[1] + coef(lm1)[2],
arrow = arrow(type = "closed", length = unit(0.02, "npc")),
color = "blue", linetype = "dashed") +
annotate("segment", x = 3.2, xend = 3.2,
y = coef(lm1)[1], yend = coef(lm1)[1] + coef(lm1)[3],
arrow = arrow(type = "closed", length = unit(0.02, "npc")),
color = "red", linetype = "dashed") +
labs(x = "")the values for Treatment1 and Treatment2 are the differences to the reference level Control and can be interpreted as an treatment effect \[
\begin{align}
\text{Control:} \quad & \mu_{\text{control}} = \alpha \\
\text{Treatment 1:} \quad & \mu_{\text{treat1}} = \alpha + \beta_{\text{treat1}} \\
\text{Treatment 2:} \quad & \mu_{\text{treat2}} = \alpha + \beta_{\text{treat2}} \\
\end{align}
\]
Hypothesis: Are the means of the treatment groups different from the control group?
df_treatment <- mutate(df_treatment, f = factor(f, levels = c("Treatment1", "Treatment2", "Control")))
f_sum_contrast <- contr.sum(3)
colnames(f_sum_contrast) <- c("Treatment1", "Treatment2")
lm_sum_contr <- lm(y ~ f, data = df_treatment,
contrasts = list(f = f_sum_contrast))
lm_sum_contr %>% coef(Intercept) fTreatment1 fTreatment2
17.541151 2.974731 3.985171
the intercept \(\alpha\) is the grand mean of all groups
the coefficients \(\beta\) are the differences to the grand mean \[ \begin{align} \text{Control:} \quad & \mu_{\text{control}} = \alpha + -\beta_{\text{treat1}} -\beta_{\text{treat2}} \\ \text{Treatment 1:} \quad & \mu_{\text{treat1}} = \alpha + \beta_{\text{treat1}} \\ \text{Treatment 2:} \quad & \mu_{\text{treat2}} = \alpha + \beta_{\text{treat2}} \\ \end{align} \]
Hypothesis: Are the mean from the treatment groups different from the grand mean?
ggplot(df_treatment, aes(f, y)) +
geom_dotplot(binaxis = "y",
stackdir = "center",
dotsize = 0.5) +
geom_hline(yintercept = coef(lm_sum_contr)[1], color = "grey") +
annotate("segment", x = 1.2, xend = 1.2,
y = coef(lm_sum_contr)[1],
yend = coef(lm_sum_contr)[1] + coef(lm_sum_contr)[2],
arrow = arrow(type = "closed", length = unit(0.02, "npc")),
color = "blue", linetype = "dashed") +
annotate("segment", x = 2.2, xend = 2.2,
y = coef(lm_sum_contr)[1],
yend = coef(lm_sum_contr)[1] + coef(lm_sum_contr)[3],
arrow = arrow(type = "closed", length = unit(0.02, "npc")),
color = "red", linetype = "dashed") +
labs(x = "")fTreatment1 fTreatment2 fControl
20.51588 21.52632 10.58125
\[ \begin{align} \text{Control:} \quad & \mu_{\text{control}} = \alpha_{\text{control}} \\ \text{Treatment 1:} \quad & \mu_{\text{treat1}} = \alpha_{\text{treat1}} \\ \text{Treatment 2:} \quad & \mu_{\text{treat2}} = \alpha_{\text{treat2}} \\ \end{align} \]
Contrast coding for linear models