The release of dplyr
v0.8.0 brought some new group functionality - described in
the release notes.
Since then, romain francois has worked on some changes to the group operation in this pull request #4251, and things have changed a bit.
The group operations which have just been modified from 0.8.0 and ready for release whenever v0.8.1 comes out are
group_split()
- split a data.frame by the grouping columns into a listgroup_data()
- a data.frame of just the grouping variables and the corresponding row indices for the members of the group in the original data.frame.group_keys()
- identical togroup_data()
except lacking that.rows
(row indicies) datagroup_map()
- map a function over each group and return a listgroup_modify()
- map a function over each group and return a single data.frame
This is a short post to try and figure out what they now do!
tl;dr!
You can safely not-read the rest of this post except for this table.
- The addition of the
keep
argument to thegroup_modify/map/split
lets you be explicit about whether the.x
list of subset data.frames includes/excludes the grouping columns. - The addition of the
.drop
argument togroup_by()
lets you control whether or not empty groups are kept in the process.
equivalent to (approx) | returns | default ‘keep’ | |
---|---|---|---|
group_split() | base::split() | list of split data.frames | TRUE |
group_data() | distinct(grouping_cols) | data.frame of just group cols | Not applicable |
group_keys() | group_data() without the .rows |
data.frame of just group cols | Not applicable |
group_map() | map2(.x = group_split(), .y = group_keys(), .fun) | list of whatever you want | FALSE |
group_modify() | map2_dfr(.x = group_split(), .y = group_keys(), .fun) | single data.frame | FALSE |
Test Data
A very simple data.frame to use as the test data. Note: There is an empty factor
level for the type
variable!
test_df <- tibble(
type = factor(c('a', 'a', 'b', 'b'), levels = c('a', 'b', 'c')),
value = c(1, 2, 3, 4)
)
test_df
# A tibble: 4 x 2
type value
<fct> <dbl>
1 a 1
2 a 2
3 b 3
4 b 4
group_split()
group_split()
breaks up a data.frame into a list of subset data.frames by
the grouping variable. It is the tidyverse equivalent of split()
.
test_df %>%
group_by(type) %>%
group_split()
[[1]]
# A tibble: 2 x 2
type value
<fct> <dbl>
1 a 1
2 a 2
[[2]]
# A tibble: 2 x 2
type value
<fct> <dbl>
1 b 3
2 b 4
attr(,"ptype")
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>
group_split()
- keep empty groups
test_df %>%
group_by(type, .drop = FALSE) %>%
group_split()
[[1]]
# A tibble: 2 x 2
type value
<fct> <dbl>
1 a 1
2 a 2
[[2]]
# A tibble: 2 x 2
type value
<fct> <dbl>
1 b 3
2 b 4
[[3]]
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>
attr(,"ptype")
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>
group_data()
group_data()
returns a data.frame with just the distinct grouping variables.
It includes a .rows
column which is a list-column of indices into the original data.frame
which belong to each group
test_df %>%
group_by(type) %>%
group_data()
# A tibble: 2 x 2
type .rows
<fct> <list>
1 a <int [2]>
2 b <int [2]>
group_keys()
group_keys()
is basically group_data()
without the .rows
test_df %>%
group_by(type) %>%
group_keys()
# A tibble: 2 x 1
type
<fct>
1 a
2 b
group_map()
group_map()
applies a function to each subset of a data.frame split by the
grouping variable. It returns a list.
group_map()
is the approximate equivalent of purrr::map2(.x = group_split(.data), .y = transpose(group_keys(.data)), .fun)
.
test_df %>%
group_by(type) %>%
group_map(~.x)
[[1]]
# A tibble: 2 x 1
value
<dbl>
1 1
2 2
[[2]]
# A tibble: 2 x 1
value
<dbl>
1 3
2 4
group_modify()
group_modify()
applies a function to each subset of a data.frame split by the
grouping variable.
The applied function must return a data.frame without the grouping columns.
group_modify()
is the approximate equivalent of purrr::map2_dfr(.x = group_split(.data), .y = transpose(group_keys(.data)), .fun)
test_df %>%
group_by(type) %>%
group_modify(~tibble(new = 1))
# A tibble: 2 x 2
# Groups: type [2]
type new
<fct> <dbl>
1 a 1
2 b 1
group_modify()
- trying to access a grouping column in .x
when keep = FALSE
In the following code, we should get a warning/error. This is because:
- the grouping variable is
type
- default
keep
is set toFALSE
, which means that.x
will not contain the grouping column
test_df %>%
group_by(type) %>%
group_modify(~tibble(
total = sum(.x$value),
newtype = paste0(.x$type, "1"))
)
Warning: Unknown or uninitialised column: 'type'.
Warning: Unknown or uninitialised column: 'type'.
# A tibble: 2 x 3
# Groups: type [2]
type total newtype
<fct> <dbl> <chr>
1 a 3 1
2 b 7 1
group_modify()
- accessing a grouping column in .y
always works
The grouping variable is always available from the .y
, regardless of the keep
setting
test_df %>%
group_by(type) %>%
group_modify(~tibble(
total = cumsum(.x$value),
newtype = paste0(.y$type, "1"))
)
# A tibble: 4 x 3
# Groups: type [2]
type total newtype
<fct> <dbl> <chr>
1 a 1 a1
2 a 3 a1
3 b 3 b1
4 b 7 b1
group_modify()
- accessing a grouping column in .x
when keep = TRUE
By setting keep = TRUE
, we ensure that the .x
includes the grouping column.
test_df %>%
group_by(type) %>%
group_modify(~tibble(
total = cumsum(.x$value),
newtype = paste0(.x$type, "1")
), keep = TRUE)
# A tibble: 4 x 3
# Groups: type [2]
type total newtype
<fct> <dbl> <chr>
1 a 1 a1
2 a 3 a1
3 b 3 b1
4 b 7 b1
Summary
- The addition of the
keep
argument to thegroup_modify/map/split
lets you be explicit about whether the.x
list of subset data.frames includes/excludes the grouping columns. - The addition of the
.drop
argument togroup_by()
lets you control whether or not empty groups are kept in the process.
equivalent to (approx) | returns | default ‘keep’ | |
---|---|---|---|
group_split() | base::split() | list of split data.frames | TRUE |
group_data() | distinct(grouping_cols) | data.frame of just group cols | Not applicable |
group_keys() | group_data() without the .rows |
data.frame of just group cols | Not applicable |
group_map() | map2(.x = group_split(), .y = group_keys(), .fun) | list of whatever you want | FALSE |
group_modify() | map2_dfr(.x = group_split(), .y = group_keys(), .fun) | single data.frame | FALSE |