The release of dplyr v0.8.0 brought some new group functionality - described in
the release notes.
Since then, romain francois has worked on some changes to the group operation in this pull request #4251, and things have changed a bit.
The group operations which have just been modified from 0.8.0 and ready for release whenever v0.8.1 comes out are
group_split()- split a data.frame by the grouping columns into a listgroup_data()- a data.frame of just the grouping variables and the corresponding row indices for the members of the group in the original data.frame.group_keys()- identical togroup_data()except lacking that.rows(row indicies) datagroup_map()- map a function over each group and return a listgroup_modify()- map a function over each group and return a single data.frame
This is a short post to convince myself I understand what group_map() does.
group_map() diagram
At its heart group_map() is going the equivalent of a purrr::map2().
The two lists that it is mapping over are:
- the list of data.frames split by the grouping variables
- the list of all combinations of the grouping variables
The following diagram hopefully illustrates how I think of group_map()
In the rest of this post I’ll break down the process step-by-step.
Test Data
A very simple data.frame to use as the test data. Note: There is an empty factor
level in the type variable!
test_df <- tibble(
type = factor(c('a', 'a', 'b', 'b', 'b'), levels = c('a', 'b', 'c')),
value = c(1, 2, 3, 4, 5)
)
test_df
# A tibble: 5 x 2
type value
<fct> <dbl>
1 a 1
2 a 2
3 b 3
4 b 4
5 b 5
group_split()
group_split() splits the data into a list of data.frames
- one data.frame for each combination of grouping variables (as specified by
group_by()) - use
.drop = FALSEin the call togroup_by()if you want to keep empty factor levels. - because
keep = FALSEwas specified ingroup_split(), the grouping column is not kept in the split data.frames
data_list <- test_df %>%
group_by(type, .drop = FALSE) %>%
group_split(keep = FALSE)
data_list
[[1]]
# A tibble: 2 x 1
value
<dbl>
1 1
2 2
[[2]]
# A tibble: 3 x 1
value
<dbl>
1 3
2 4
3 5
[[3]]
# A tibble: 0 x 1
# … with 1 variable: value <dbl>
attr(,"ptype")
# A tibble: 0 x 1
# … with 1 variable: value <dbl>
group_keys()
group_keys() returns a data.frame with just the grouping columns - keeping only one row
for each combination of grouping variables.
Transpose the result to get a list of values - one list entry for each row.
Note: I used to use a transpose() here, but because it drops factor levels,
I’ve changed this to a pmap(list) instead which preserves factor levels. (thanks to jennybryan)
groups_list <- test_df %>%
group_by(type, .drop = FALSE) %>%
group_keys() %>%
pmap(list)
groups_list
[[1]]
[[1]]$type
[1] a
Levels: a b c
[[2]]
[[2]]$type
[1] b
Levels: a b c
[[3]]
[[3]]$type
[1] c
Levels: a b c
Calling map2 over the split data.frames and the group information for each split data.frame
my_func <- function(.x, .y) {
glue("Group '{.y$type}' has {nrow(.x)} rows")
}
map2(data_list, groups_list, my_func)
[[1]]
Group 'a' has 2 rows
[[2]]
Group 'b' has 3 rows
[[3]]
Group 'c' has 0 rows
group_map() is a shortcut for all the above!
group_map() does all this for you - at a high level you can think of it as doing the following:
- splits the data.frame by the grouping variables
- creates a list of all combinations of grouping variables
- calls
map2()
test_df %>%
group_by(type, .drop = FALSE) %>%
group_map(my_func)
[[1]]
Group 'a' has 2 rows
[[2]]
Group 'b' has 3 rows
[[3]]
Group 'c' has 0 rows