group_split/map/modify() in dplyr 0.8.1 (still in development)

The release of dplyr v0.8.0 brought some new group functionality - described in the release notes.

Since then, romain francois has worked on some changes to the group operation in this pull request #4251, and things have changed a bit.

The group operations which have just been modified from 0.8.0 and ready for release whenever v0.8.1 comes out are

  • group_split() - split a data.frame by the grouping columns into a list
  • group_data() - a data.frame of just the grouping variables and the corresponding row indices for the members of the group in the original data.frame.
  • group_keys() - identical to group_data() except lacking that .rows (row indicies) data
  • group_map() - map a function over each group and return a list
  • group_modify() - map a function over each group and return a single data.frame

This is a short post to try and figure out what they now do!

tl;dr!

You can safely not-read the rest of this post except for this table.

  • The addition of the keep argument to the group_modify/map/split lets you be explicit about whether the .x list of subset data.frames includes/excludes the grouping columns.
  • The addition of the .drop argument to group_by() lets you control whether or not empty groups are kept in the process.
equivalent to (approx) returns default ‘keep’
group_split() base::split() list of split data.frames TRUE
group_data() distinct(grouping_cols) data.frame of just group cols Not applicable
group_keys() group_data() without the .rows data.frame of just group cols Not applicable
group_map() map2(.x = group_split(), .y = group_keys(), .fun) list of whatever you want FALSE
group_modify() map2_dfr(.x = group_split(), .y = group_keys(), .fun) single data.frame FALSE

Test Data

A very simple data.frame to use as the test data. Note: There is an empty factor level for the type variable!

test_df <- tibble(
  type  = factor(c('a', 'a', 'b', 'b'), levels = c('a', 'b', 'c')),
  value = c(1, 2, 3, 4)
)

test_df
# A tibble: 4 x 2
  type  value
  <fct> <dbl>
1 a         1
2 a         2
3 b         3
4 b         4

group_split()

group_split() breaks up a data.frame into a list of subset data.frames by the grouping variable. It is the tidyverse equivalent of split().

test_df %>%
  group_by(type) %>%
  group_split()
[[1]]
# A tibble: 2 x 2
  type  value
  <fct> <dbl>
1 a         1
2 a         2

[[2]]
# A tibble: 2 x 2
  type  value
  <fct> <dbl>
1 b         3
2 b         4

attr(,"ptype")
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>

group_split() - keep empty groups

test_df %>%
  group_by(type, .drop = FALSE) %>%
  group_split()
[[1]]
# A tibble: 2 x 2
  type  value
  <fct> <dbl>
1 a         1
2 a         2

[[2]]
# A tibble: 2 x 2
  type  value
  <fct> <dbl>
1 b         3
2 b         4

[[3]]
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>

attr(,"ptype")
# A tibble: 0 x 2
# … with 2 variables: type <fct>, value <dbl>

group_data()

group_data() returns a data.frame with just the distinct grouping variables. It includes a .rows column which is a list-column of indices into the original data.frame which belong to each group

test_df %>%
  group_by(type) %>%
  group_data()
# A tibble: 2 x 2
  type  .rows    
  <fct> <list>   
1 a     <int [2]>
2 b     <int [2]>

group_keys()

group_keys() is basically group_data() without the .rows

test_df %>%
  group_by(type) %>%
  group_keys()
# A tibble: 2 x 1
  type 
  <fct>
1 a    
2 b    

group_map()

group_map() applies a function to each subset of a data.frame split by the grouping variable. It returns a list.

group_map() is the approximate equivalent of purrr::map2(.x = group_split(.data), .y = transpose(group_keys(.data)), .fun).

test_df %>%
  group_by(type) %>%
  group_map(~.x)
[[1]]
# A tibble: 2 x 1
  value
  <dbl>
1     1
2     2

[[2]]
# A tibble: 2 x 1
  value
  <dbl>
1     3
2     4

group_modify()

group_modify() applies a function to each subset of a data.frame split by the grouping variable.

The applied function must return a data.frame without the grouping columns.

group_modify() is the approximate equivalent of purrr::map2_dfr(.x = group_split(.data), .y = transpose(group_keys(.data)), .fun)

test_df %>%
  group_by(type) %>%
  group_modify(~tibble(new = 1))
# A tibble: 2 x 2
# Groups:   type [2]
  type    new
  <fct> <dbl>
1 a         1
2 b         1

group_modify() - trying to access a grouping column in .x when keep = FALSE

In the following code, we should get a warning/error. This is because:

  • the grouping variable is type
  • default keep is set to FALSE, which means that .x will not contain the grouping column
test_df %>%
  group_by(type) %>%
  group_modify(~tibble(
    total   = sum(.x$value),
    newtype = paste0(.x$type, "1"))
  )
Warning: Unknown or uninitialised column: 'type'.

Warning: Unknown or uninitialised column: 'type'.
# A tibble: 2 x 3
# Groups:   type [2]
  type  total newtype
  <fct> <dbl> <chr>  
1 a         3 1      
2 b         7 1      

group_modify() - accessing a grouping column in .y always works

The grouping variable is always available from the .y, regardless of the keep setting

test_df %>%
  group_by(type) %>%
  group_modify(~tibble(
    total   = cumsum(.x$value),
    newtype = paste0(.y$type, "1"))
  )
# A tibble: 4 x 3
# Groups:   type [2]
  type  total newtype
  <fct> <dbl> <chr>  
1 a         1 a1     
2 a         3 a1     
3 b         3 b1     
4 b         7 b1     

group_modify() - accessing a grouping column in .x when keep = TRUE

By setting keep = TRUE, we ensure that the .x includes the grouping column.

test_df %>%
  group_by(type) %>%
  group_modify(~tibble(
    total   = cumsum(.x$value),
    newtype = paste0(.x$type, "1")
  ), keep = TRUE)
# A tibble: 4 x 3
# Groups:   type [2]
  type  total newtype
  <fct> <dbl> <chr>  
1 a         1 a1     
2 a         3 a1     
3 b         3 b1     
4 b         7 b1     

Summary

  • The addition of the keep argument to the group_modify/map/split lets you be explicit about whether the .x list of subset data.frames includes/excludes the grouping columns.
  • The addition of the .drop argument to group_by() lets you control whether or not empty groups are kept in the process.
equivalent to (approx) returns default ‘keep’
group_split() base::split() list of split data.frames TRUE
group_data() distinct(grouping_cols) data.frame of just group cols Not applicable
group_keys() group_data() without the .rows data.frame of just group cols Not applicable
group_map() map2(.x = group_split(), .y = group_keys(), .fun) list of whatever you want FALSE
group_modify() map2_dfr(.x = group_split(), .y = group_keys(), .fun) single data.frame FALSE