The release of dplyr v0.8.0 brought some new group functionality - described in
the release notes.
Since then, romain francois has worked on some changes to the group operation in this pull request #4251, and things have changed a bit.
The group operations which have just been modified from 0.8.0 and ready for release whenever v0.8.1 comes out are
group_split()- split a data.frame by the grouping columns into a listgroup_data()- a data.frame of just the grouping variables and the corresponding row indices for the members of the group in the original data.frame.group_keys()- identical togroup_data()except lacking that.rows(row indicies) datagroup_map()- map a function over each group and return a listgroup_modify()- map a function over each group and return a single data.frame
This is a short post to try and figure out what they now do!
tl;dr!
You can safely not-read the rest of this post except for this table.
- The addition of the
keepargument to thegroup_modify/map/splitlets you be explicit about whether the.xlist of subset data.frames includes/excludes the grouping columns. - The addition of the
.dropargument togroup_by()lets you control whether or not empty groups are kept in the process.
| equivalent to (approx) | returns | default ‘keep’ | |
|---|---|---|---|
| group_split() | base::split() | list of split data.frames | TRUE |
| group_data() | distinct(grouping_cols) | data.frame of just group cols | Not applicable |
| group_keys() | group_data() without the .rows |
data.frame of just group cols | Not applicable |
| group_map() | map2(.x = group_split(), .y = group_keys(), .fun) | list of whatever you want | FALSE |
| group_modify() | map2_dfr(.x = group_split(), .y = group_keys(), .fun) | single data.frame | FALSE |
Test Data
A very simple data.frame to use as the test data. Note: There is an empty factor
level for the type variable!
test_df <- tibble(
type = factor(c('a', 'a', 'b', 'b'), levels = c('a', 'b', 'c')),
value = c(1, 2, 3, 4)
)
test_df
# A tibble: 4 x 2
type value
<fct> <dbl>
1 a 1
2 a 2
3 b 3
4 b 4
group_split()
group_split() breaks up a data.frame into a list of subset data.frames by
the grouping variable. It is the tidyverse equivalent of split().
test_df %>%
group_by(type) %>%
group_split()
[[1]]
# A tibble: 2 x 2
type value
<fct> <dbl>
1 a 1
2 a 2
[[2]]
# A tibble: 2 x 2
type value
<fct> <dbl>
1 b 3
2 b 4
attr(,"ptype")
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>
group_split() - keep empty groups
test_df %>%
group_by(type, .drop = FALSE) %>%
group_split()
[[1]]
# A tibble: 2 x 2
type value
<fct> <dbl>
1 a 1
2 a 2
[[2]]
# A tibble: 2 x 2
type value
<fct> <dbl>
1 b 3
2 b 4
[[3]]
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>
attr(,"ptype")
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>
group_data()
group_data() returns a data.frame with just the distinct grouping variables.
It includes a .rows column which is a list-column of indices into the original data.frame
which belong to each group
test_df %>%
group_by(type) %>%
group_data()
# A tibble: 2 x 2
type .rows
<fct> <list>
1 a <int [2]>
2 b <int [2]>
group_keys()
group_keys() is basically group_data() without the .rows
test_df %>%
group_by(type) %>%
group_keys()
# A tibble: 2 x 1
type
<fct>
1 a
2 b
group_map()
group_map() applies a function to each subset of a data.frame split by the
grouping variable. It returns a list.
group_map() is the approximate equivalent of purrr::map2(.x = group_split(.data), .y = transpose(group_keys(.data)), .fun).
test_df %>%
group_by(type) %>%
group_map(~.x)
[[1]]
# A tibble: 2 x 1
value
<dbl>
1 1
2 2
[[2]]
# A tibble: 2 x 1
value
<dbl>
1 3
2 4
group_modify()
group_modify() applies a function to each subset of a data.frame split by the
grouping variable.
The applied function must return a data.frame without the grouping columns.
group_modify() is the approximate equivalent of purrr::map2_dfr(.x = group_split(.data), .y = transpose(group_keys(.data)), .fun)
test_df %>%
group_by(type) %>%
group_modify(~tibble(new = 1))
# A tibble: 2 x 2
# Groups: type [2]
type new
<fct> <dbl>
1 a 1
2 b 1
group_modify() - trying to access a grouping column in .x when keep = FALSE
In the following code, we should get a warning/error. This is because:
- the grouping variable is
type - default
keepis set toFALSE, which means that.xwill not contain the grouping column
test_df %>%
group_by(type) %>%
group_modify(~tibble(
total = sum(.x$value),
newtype = paste0(.x$type, "1"))
)
Warning: Unknown or uninitialised column: 'type'.
Warning: Unknown or uninitialised column: 'type'.
# A tibble: 2 x 3
# Groups: type [2]
type total newtype
<fct> <dbl> <chr>
1 a 3 1
2 b 7 1
group_modify() - accessing a grouping column in .y always works
The grouping variable is always available from the .y, regardless of the keep setting
test_df %>%
group_by(type) %>%
group_modify(~tibble(
total = cumsum(.x$value),
newtype = paste0(.y$type, "1"))
)
# A tibble: 4 x 3
# Groups: type [2]
type total newtype
<fct> <dbl> <chr>
1 a 1 a1
2 a 3 a1
3 b 3 b1
4 b 7 b1
group_modify() - accessing a grouping column in .x when keep = TRUE
By setting keep = TRUE, we ensure that the .x includes the grouping column.
test_df %>%
group_by(type) %>%
group_modify(~tibble(
total = cumsum(.x$value),
newtype = paste0(.x$type, "1")
), keep = TRUE)
# A tibble: 4 x 3
# Groups: type [2]
type total newtype
<fct> <dbl> <chr>
1 a 1 a1
2 a 3 a1
3 b 3 b1
4 b 7 b1
Summary
- The addition of the
keepargument to thegroup_modify/map/splitlets you be explicit about whether the.xlist of subset data.frames includes/excludes the grouping columns. - The addition of the
.dropargument togroup_by()lets you control whether or not empty groups are kept in the process.
| equivalent to (approx) | returns | default ‘keep’ | |
|---|---|---|---|
| group_split() | base::split() | list of split data.frames | TRUE |
| group_data() | distinct(grouping_cols) | data.frame of just group cols | Not applicable |
| group_keys() | group_data() without the .rows |
data.frame of just group cols | Not applicable |
| group_map() | map2(.x = group_split(), .y = group_keys(), .fun) | list of whatever you want | FALSE |
| group_modify() | map2_dfr(.x = group_split(), .y = group_keys(), .fun) | single data.frame | FALSE |