The release of dplyr
v0.8.0 brought some new group functionality - described in
the release notes.
Since then, romain francois has worked on some changes to the group operation in this pull request #4251, and things have changed a bit.
The group operations which have just been modified from 0.8.0 and ready for release whenever v0.8.1 comes out are
group_split()
- split a data.frame by the grouping columns into a listgroup_data()
- a data.frame of just the grouping variables and the corresponding row indices for the members of the group in the original data.frame.group_keys()
- identical togroup_data()
except lacking that.rows
(row indicies) datagroup_map()
- map a function over each group and return a listgroup_modify()
- map a function over each group and return a single data.frame
This is a short post to convince myself I understand what group_map()
does.
group_map()
diagram
At its heart group_map()
is going the equivalent of a purrr::map2()
.
The two lists that it is mapping over are:
- the list of data.frames split by the grouping variables
- the list of all combinations of the grouping variables
The following diagram hopefully illustrates how I think of group_map()
In the rest of this post I’ll break down the process step-by-step.
Test Data
A very simple data.frame to use as the test data. Note: There is an empty factor
level in the type
variable!
test_df <- tibble(
type = factor(c('a', 'a', 'b', 'b', 'b'), levels = c('a', 'b', 'c')),
value = c(1, 2, 3, 4, 5)
)
test_df
# A tibble: 5 x 2
type value
<fct> <dbl>
1 a 1
2 a 2
3 b 3
4 b 4
5 b 5
group_split()
group_split()
splits the data into a list of data.frames
- one data.frame for each combination of grouping variables (as specified by
group_by()
) - use
.drop = FALSE
in the call togroup_by()
if you want to keep empty factor levels. - because
keep = FALSE
was specified ingroup_split()
, the grouping column is not kept in the split data.frames
data_list <- test_df %>%
group_by(type, .drop = FALSE) %>%
group_split(keep = FALSE)
data_list
[[1]]
# A tibble: 2 x 1
value
<dbl>
1 1
2 2
[[2]]
# A tibble: 3 x 1
value
<dbl>
1 3
2 4
3 5
[[3]]
# A tibble: 0 x 1
# … with 1 variable: value <dbl>
attr(,"ptype")
# A tibble: 0 x 1
# … with 1 variable: value <dbl>
group_keys()
group_keys()
returns a data.frame with just the grouping columns - keeping only one row
for each combination of grouping variables.
Transpose the result to get a list of values - one list entry for each row.
Note: I used to use a transpose()
here, but because it drops factor levels,
I’ve changed this to a pmap(list)
instead which preserves factor levels. (thanks to jennybryan)
groups_list <- test_df %>%
group_by(type, .drop = FALSE) %>%
group_keys() %>%
pmap(list)
groups_list
[[1]]
[[1]]$type
[1] a
Levels: a b c
[[2]]
[[2]]$type
[1] b
Levels: a b c
[[3]]
[[3]]$type
[1] c
Levels: a b c
Calling map2
over the split data.frames and the group information for each split data.frame
my_func <- function(.x, .y) {
glue("Group '{.y$type}' has {nrow(.x)} rows")
}
map2(data_list, groups_list, my_func)
[[1]]
Group 'a' has 2 rows
[[2]]
Group 'b' has 3 rows
[[3]]
Group 'c' has 0 rows
group_map()
is a shortcut for all the above!
group_map()
does all this for you - at a high level you can think of it as doing the following:
- splits the data.frame by the grouping variables
- creates a list of all combinations of grouping variables
- calls
map2()
test_df %>%
group_by(type, .drop = FALSE) %>%
group_map(my_func)
[[1]]
Group 'a' has 2 rows
[[2]]
Group 'b' has 3 rows
[[3]]
Group 'c' has 0 rows