mikefc

The release of dplyr v0.8.0 brought some new group functionality - described in the release notes.

Since then, romain francois has worked on some changes to the group operation in this pull request #4251, and things have changed a bit.

The group operations which have just been modified from 0.8.0 and ready for release whenever v0.8.1 comes out are

  • group_split() - split a data.frame by the grouping columns into a list
  • group_data() - a data.frame of just the grouping variables and the corresponding row indices for the members of the group in the original data.frame.
  • group_keys() - identical to group_data() except lacking that .rows (row indicies) data
  • group_map() - map a function over each group and return a list
  • group_modify() - map a function over each group and return a single data.frame

This is a short post to convince myself I understand what group_map() does.

group_map() diagram

At its heart group_map() is going the equivalent of a purrr::map2().

The two lists that it is mapping over are:

  1. the list of data.frames split by the grouping variables
  2. the list of all combinations of the grouping variables

The following diagram hopefully illustrates how I think of group_map()

In the rest of this post I’ll break down the process step-by-step.

Test Data

A very simple data.frame to use as the test data. Note: There is an empty factor level in the type variable!

test_df <- tibble(
  type  = factor(c('a', 'a', 'b', 'b', 'b'), levels = c('a', 'b', 'c')),
  value = c(1, 2, 3, 4, 5)
)

test_df
# A tibble: 5 x 2
  type  value
  <fct> <dbl>
1 a         1
2 a         2
3 b         3
4 b         4
5 b         5

group_split()

group_split() splits the data into a list of data.frames

  • one data.frame for each combination of grouping variables (as specified by group_by())
  • use .drop = FALSE in the call to group_by() if you want to keep empty factor levels.
  • because keep = FALSE was specified in group_split(), the grouping column is not kept in the split data.frames
data_list <- test_df %>% 
  group_by(type, .drop = FALSE) %>%
  group_split(keep = FALSE)

data_list
[[1]]
# A tibble: 2 x 1
  value
  <dbl>
1     1
2     2

[[2]]
# A tibble: 3 x 1
  value
  <dbl>
1     3
2     4
3     5

[[3]]
# A tibble: 0 x 1
# … with 1 variable: value <dbl>

group_keys()

group_keys() returns a data.frame with just the grouping columns - keeping only one row for each combination of grouping variables.

Transpose the result to get a list of values - one list entry for each row.

Note: I used to use a transpose() here, but because it drops factor levels, I’ve changed this to a pmap(list) instead which preserves factor levels. (thanks to jennybryan)

groups_list <- test_df %>%
  group_by(type, .drop = FALSE) %>%
  group_keys() %>%
  pmap(list)

groups_list
[[1]]
[[1]]$type
[1] a
Levels: a b c


[[2]]
[[2]]$type
[1] b
Levels: a b c


[[3]]
[[3]]$type
[1] c
Levels: a b c

Calling map2 over the split data.frames and the group information for each split data.frame

my_func <- function(.x, .y) {
  glue("Group '{.y$type}' has {nrow(.x)} rows")
}

map2(data_list, groups_list, my_func)
[[1]]
Group 'a' has 2 rows

[[2]]
Group 'b' has 3 rows

[[3]]
Group 'c' has 0 rows

group_map() is a shortcut for all the above!

group_map() does all this for you - at a high level you can think of it as doing the following:

  • splits the data.frame by the grouping variables
  • creates a list of all combinations of grouping variables
  • calls map2()
test_df %>%
  group_by(type, .drop = FALSE) %>%
  group_map(my_func)
[[1]]
Group 'a' has 2 rows

[[2]]
Group 'b' has 3 rows

[[3]]
Group 'c' has 0 rows