Day 3
Freie Universität Berlin @ Theoretical Ecology
Idea of Tidy Data
Standardized data structure for easy and efficient data cleaning, visualization and analysis
Happy families are all alike; every unhappy family is unhappy in its own way — Leo Tolstoy
Illustration from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
# A tibble: 4 × 4
name quiz1 quiz2 test1
<chr> <chr> <chr> <chr>
1 Billy <NA> D C
2 Suzy F <NA> <NA>
3 Lionel B C B
4 Jenny A A B
# A tibble: 4 × 4
name quiz1 quiz2 test1
<chr> <chr> <chr> <chr>
1 Billy <NA> D C
2 Suzy F <NA> <NA>
3 Lionel B C B
4 Jenny A A B
# A tibble: 3 × 5
assessment Billy Suzy Lionel Jenny
<chr> <chr> <chr> <chr> <chr>
1 quiz1 <NA> F B A
2 quiz2 D <NA> C A
3 test1 C <NA> B B
# A tibble: 4 × 4
name quiz1 quiz2 test1
<chr> <chr> <chr> <chr>
1 Billy <NA> D C
2 Suzy F <NA> <NA>
3 Lionel B C B
4 Jenny A A B
Illustration from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
# A tibble: 12 × 3
name assessment grade
<chr> <chr> <chr>
1 Billy quiz1 <NA>
2 Billy quiz2 D
3 Billy test1 C
4 Jenny quiz1 A
5 Jenny quiz2 A
6 Jenny test1 B
7 Lionel quiz1 B
8 Lionel quiz2 C
9 Lionel test1 B
10 Suzy quiz1 F
11 Suzy quiz2 <NA>
12 Suzy test1 <NA>
Caution
There is no unique answer what is a variable and observation. It depends on your questions.
# A tibble: 4 × 4
name quiz1 quiz2 test1
<chr> <chr> <chr> <chr>
1 Billy <NA> D C
2 Suzy F <NA> <NA>
3 Lionel B C B
4 Jenny A A B
tidyr::pivot_longer
# A tibble: 4 × 4
name quiz1 quiz2 test1
<chr> <chr> <chr> <chr>
1 Billy <NA> D C
2 Suzy F <NA> <NA>
3 Lionel B C B
4 Jenny A A B

classroom2 <- classroom %>%
pivot_longer(quiz1:test1,
names_to = "assessment",
values_to = "grade") %>%
arrange(name, assessment)
classroom2# A tibble: 12 × 3
name assessment grade
<chr> <chr> <chr>
1 Billy quiz1 <NA>
2 Billy quiz2 D
3 Billy test1 C
4 Jenny quiz1 A
5 Jenny quiz2 A
6 Jenny test1 B
7 Lionel quiz1 B
8 Lionel quiz2 C
9 Lionel test1 B
10 Suzy quiz1 F
11 Suzy quiz2 <NA>
12 Suzy test1 <NA>
The most common problems with messy datasets
cities_v <- c("Istanbul", "Moscow", "London", "Saint Petersburg", "Berlin", "Madrid", "Kyiv", "Rome", "Bucharest", "Paris")
population <- c(15.1e6, 12.5e6, 9e6, 5.4e6, 3.8e6, 3.2e6, 3e6, 2.8e6, 2.2e6, 2.1e6)
area_km2 <- c(2576, 2561, 1572, 1439,891,604, 839, 1285, 228, 105 )
country <- c("Turkey", "Russia", "UK", "Russia", "Germany", "Spain",
"Ukraine", "Italy", "Romania", "France")
# tidy dataset
cities_tidy <- tibble(city = cities_v,
population = population,
area_km2 = area_km2,
country = country)
# make untidy
cities_untidy <- cities_tidy %>%
unite(col = "location", c("country", "city")) %>%
pivot_longer(c("population", "area_km2"), names_to = "type") %>%
pivot_wider(names_from = "location", values_from = "value")
# print in nice format
cities_untidy %>% knitr::kable()| type | Turkey_Istanbul | Russia_Moscow | UK_London | Russia_Saint Petersburg | Germany_Berlin | Spain_Madrid | Ukraine_Kyiv | Italy_Rome | Romania_Bucharest | France_Paris |
|---|---|---|---|---|---|---|---|---|---|---|
| population | 15100000 | 12500000 | 9000000 | 5400000 | 3800000 | 3200000 | 3000000 | 2800000 | 2200000 | 2100000 |
| area_km2 | 2576 | 2561 | 1572 | 1439 | 891 | 604 | 839 | 1285 | 228 | 105 |
# A tibble: 2 × 11
type Turkey_Istanbul Russia_Moscow UK_London `Russia_Saint Petersburg`
<chr> <dbl> <dbl> <dbl> <dbl>
1 population 15100000 12500000 9000000 5400000
2 area_km2 2576 2561 1572 1439
# ℹ 6 more variables: Germany_Berlin <dbl>, Spain_Madrid <dbl>,
# Ukraine_Kyiv <dbl>, Italy_Rome <dbl>, Romania_Bucharest <dbl>,
# France_Paris <dbl>
# A tibble: 2 × 11
type Turkey_Istanbul Russia_Moscow UK_London `Russia_Saint Petersburg`
<chr> <dbl> <dbl> <dbl> <dbl>
1 population 15100000 12500000 9000000 5400000
2 area_km2 2576 2561 1572 1439
# ℹ 6 more variables: Germany_Berlin <dbl>, Spain_Madrid <dbl>,
# Ukraine_Kyiv <dbl>, Italy_Rome <dbl>, Romania_Bucharest <dbl>,
# France_Paris <dbl>
What is not tidy here?
Tip
Let’s tidy this data using functions from the tidyr package!
pivot_longer()One variable split into multiple columns can be solved with pivot_longer
step1 <- pivot_longer(
cities_untidy, # the tibble
cols = Turkey_Istanbul:France_Paris, # the columns to pivot from:to
names_to = "location", # name of the new column
values_to = "value") # name of the value column# A tibble: 20 × 3
type location value
<chr> <chr> <dbl>
1 population Turkey_Istanbul 15100000
2 population Russia_Moscow 12500000
3 population UK_London 9000000
4 population Russia_Saint Petersburg 5400000
# ℹ 16 more rows
Another way to select the columns to pivot:
separate_wider_delim()Multiple variable values that are united into one can be separated using separate_wider_delim
step2 <- separate_wider_delim(
step1, # the tibble
location, # the column to separate
delim = "_", # the separator
names = c("country", "city")) # names of new columns# A tibble: 20 × 4
type country city value
<chr> <chr> <chr> <dbl>
1 population Turkey Istanbul 15100000
2 population Russia Moscow 12500000
# ℹ 18 more rows
The opposite function exists as well and is called unite. Check out ?unite for details.
pivot_wider()One observation split into multiple rows can solved with pivot_wider
summary(step2) type country city value
area_km2 :10 Russia :4 Berlin :2 Min. : 105
population:10 France :2 Bucharest:2 1st Qu.: 1186
Germany:2 Istanbul :2 Median : 1051288
Italy :2 Kyiv :2 Mean : 2955605
Romania:2 London :2 3rd Qu.: 3350000
Spain :2 Madrid :2 Max. :15100000
(Other):6 (Other) :8
step3 <- pivot_wider(
step2, # the tibble
names_from = type, # the variables
values_from = value) # the values# A tibble: 10 × 4
country city population area_km2
<fct> <fct> <dbl> <dbl>
1 Turkey Istanbul 15100000 2576
2 Russia Moscow 12500000 2561
3 UK London 9000000 1572
4 Russia Saint Petersburg 5400000 1439
5 Germany Berlin 3800000 891
# ℹ 5 more rows
Now, we have a tidy data set!!!
We can also use a pipe to do all these steps in one:
cities_tidy <- cities_untidy %>%
pivot_longer(
Turkey_Istanbul:France_Paris,
names_to = "location",
values_to = "values"
) %>%
separate_wider_delim(
location,
delim = "_",
names = c("country", "city")
) %>%
pivot_wider(
names_from = type,
values_from = values
)Now it is your turn!
Tidy Data with tidyr