group_split/map/modify() in dplyr 0.8.1 (still in development)

The release of dplyr v0.8.0 brought some new group functionality - described in the release notes.

Since then, romain francois has worked on some changes to the group operation in this pull request #4251, and things have changed a bit.

The group operations which have just been modified from 0.8.0 and ready for release whenever v0.8.1 comes out are

group_split() - split a data.frame by the grouping columns into a list
group_data() - a data.frame of just the grouping variables and the corresponding row indices for the members of the group in the original data.frame.
group_keys() - identical to group_data() except lacking that .rows (row indicies) data
group_map() - map a function over each group and return a list
group_modify() - map a function over each group and return a single data.frame

This is a short post to try and figure out what they now do!

tl;dr!

You can safely not-read the rest of this post except for this table.

The addition of the keep argument to the group_modify/map/split lets you be explicit about whether the .x list of subset data.frames includes/excludes the grouping columns.
The addition of the .drop argument to group_by() lets you control whether or not empty groups are kept in the process.

	equivalent to (approx)	returns	default ‘keep’
group_split()	base::split()	list of split data.frames	TRUE
group_data()	distinct(grouping_cols)	data.frame of just group cols	Not applicable
group_keys()	group_data() without the `.rows`	data.frame of just group cols	Not applicable
group_map()	map2(.x = group_split(), .y = group_keys(), .fun)	list of whatever you want	FALSE
group_modify()	map2_dfr(.x = group_split(), .y = group_keys(), .fun)	single data.frame	FALSE

Test Data

A very simple data.frame to use as the test data. Note: There is an empty factor level for the type variable!

test_df <- tibble(
  type  = factor(c('a', 'a', 'b', 'b'), levels = c('a', 'b', 'c')),
  value = c(1, 2, 3, 4)
)

test_df

# A tibble: 4 x 2
  type  value
  <fct> <dbl>
1 a         1
2 a         2
3 b         3
4 b         4

`group_split()`

group_split() breaks up a data.frame into a list of subset data.frames by the grouping variable. It is the tidyverse equivalent of split().

test_df %>%
  group_by(type) %>%
  group_split()

[[1]]
# A tibble: 2 x 2
  type  value
  <fct> <dbl>
1 a         1
2 a         2

[[2]]
# A tibble: 2 x 2
  type  value
  <fct> <dbl>
1 b         3
2 b         4

attr(,"ptype")
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>

`group_split()` - keep empty groups

test_df %>%
  group_by(type, .drop = FALSE) %>%
  group_split()

[[1]]
# A tibble: 2 x 2
  type  value
  <fct> <dbl>
1 a         1
2 a         2

[[2]]
# A tibble: 2 x 2
  type  value
  <fct> <dbl>
1 b         3
2 b         4

[[3]]
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>

attr(,"ptype")
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>

`group_data()`

group_data() returns a data.frame with just the distinct grouping variables. It includes a .rows column which is a list-column of indices into the original data.frame which belong to each group

test_df %>%
  group_by(type) %>%
  group_data()

# A tibble: 2 x 2
  type  .rows    
  <fct> <list>   
1 a     <int [2]>
2 b     <int [2]>

`group_keys()`

group_keys() is basically group_data() without the .rows

test_df %>%
  group_by(type) %>%
  group_keys()

# A tibble: 2 x 1
  type 
  <fct>
1 a    
2 b

`group_map()`

group_map() applies a function to each subset of a data.frame split by the grouping variable. It returns a list.

group_map() is the approximate equivalent of purrr::map2(.x = group_split(.data), .y = transpose(group_keys(.data)), .fun).

test_df %>%
  group_by(type) %>%
  group_map(~.x)

[[1]]
# A tibble: 2 x 1
  value
  <dbl>
1     1
2     2

[[2]]
# A tibble: 2 x 1
  value
  <dbl>
1     3
2     4

`group_modify()`

group_modify() applies a function to each subset of a data.frame split by the grouping variable.

The applied function must return a data.frame without the grouping columns.

group_modify() is the approximate equivalent of purrr::map2_dfr(.x = group_split(.data), .y = transpose(group_keys(.data)), .fun)

test_df %>%
  group_by(type) %>%
  group_modify(~tibble(new = 1))

# A tibble: 2 x 2
# Groups:   type [2]
  type    new
  <fct> <dbl>
1 a         1
2 b         1

`group_modify()` - trying to access a grouping column in `.x` when `keep = FALSE`

In the following code, we should get a warning/error. This is because:

the grouping variable is type
default keep is set to FALSE, which means that .x will not contain the grouping column

test_df %>%
  group_by(type) %>%
  group_modify(~tibble(
    total   = sum(.x$value),
    newtype = paste0(.x$type, "1"))
  )

Warning: Unknown or uninitialised column: 'type'.

Warning: Unknown or uninitialised column: 'type'.

# A tibble: 2 x 3
# Groups:   type [2]
  type  total newtype
  <fct> <dbl> <chr>  
1 a         3 1      
2 b         7 1

`group_modify()` - accessing a grouping column in `.y` always works

The grouping variable is always available from the .y, regardless of the keep setting

test_df %>%
  group_by(type) %>%
  group_modify(~tibble(
    total   = cumsum(.x$value),
    newtype = paste0(.y$type, "1"))
  )

# A tibble: 4 x 3
# Groups:   type [2]
  type  total newtype
  <fct> <dbl> <chr>  
1 a         1 a1     
2 a         3 a1     
3 b         3 b1     
4 b         7 b1

`group_modify()` - accessing a grouping column in `.x` when `keep = TRUE`

By setting keep = TRUE, we ensure that the .x includes the grouping column.

test_df %>%
  group_by(type) %>%
  group_modify(~tibble(
    total   = cumsum(.x$value),
    newtype = paste0(.x$type, "1")
  ), keep = TRUE)

# A tibble: 4 x 3
# Groups:   type [2]
  type  total newtype
  <fct> <dbl> <chr>  
1 a         1 a1     
2 a         3 a1     
3 b         3 b1     
4 b         7 b1

Summary

The addition of the keep argument to the group_modify/map/split lets you be explicit about whether the .x list of subset data.frames includes/excludes the grouping columns.
The addition of the .drop argument to group_by() lets you control whether or not empty groups are kept in the process.

	equivalent to (approx)	returns	default ‘keep’
group_split()	base::split()	list of split data.frames	TRUE
group_data()	distinct(grouping_cols)	data.frame of just group cols	Not applicable
group_keys()	group_data() without the `.rows`	data.frame of just group cols	Not applicable
group_map()	map2(.x = group_split(), .y = group_keys(), .fun)	list of whatever you want	FALSE
group_modify()	map2_dfr(.x = group_split(), .y = group_keys(), .fun)	single data.frame	FALSE

group_split/map/modify() in dplyr 0.8.1 (still in development)

tl;dr!

Test Data

group_split()

group_split() - keep empty groups

group_data()

group_keys()

group_map()

group_modify()

group_modify() - trying to access a grouping column in .x when keep = FALSE

group_modify() - accessing a grouping column in .y always works

group_modify() - accessing a grouping column in .x when keep = TRUE