Split-Apply-Combine: My search for a replacement for 'group_by + do'

Introduction

I currently process a lot of data a single entity at a time, but have a data.frame representing multiple entities as input.

I have specialist functions that do a lot of work on the data.frame for a single entity, so I want to split the original data.frame into multiple data.frames containing just one entity each and then process them one at a time.

This is a classic case of split-apply-combine, as outlined Hadley Wickham’s JStatSoft paper (pdf) and Jenny Bryan’s Stat545 notes

Currently I uses dplyr group_by then do to achieve, this.

But as of 2016, dplyr::do() is “basically deprecated” according to Hadley Wickham on Twitter:

@ijlyttle btw dplyr::do() is now basically deprecated in favour of the purrr approach
— Hadley Wickham (@hadleywickham) April 11, 2016

On a more recent community.rstudio.com thread, Hadley expanded:

do() is definitely going away in the long term, but I’m not yet sure we have comprehensive alternative solutions to all problems that do() solves.

(Also “going away” means that we won’t make improvements to it and we won’t mention it in documentation and tutorials, but the code will continue to exist for a number of years)

Given that the original tweet was from 2 years ago, and I’m still using group_by/do, it’s time I start searching for a usable “purrr approach” that suits my needs.

If there are tidyverse options I haven’t yet discovered, please let me know on twitter!

Split-Apply-Combine - Prehistoric times - split, lapply, do.call(rbind, …)
Split-Apply-Combine - Stone Age with plyr - plyr::ddply
Split-Apply-Combine - Early tidyverse era - group_by, do
Split-Apply-Combine - Early-mid tidyverse era group_by & by_slice
Split-Apply-Combine - Current era tidyverse: group_by, nest, mutate(map())
Split-Apply-Combine - tidyverse/base hybrid: split, map_dfr
Summary Table

The test data - `big_df`

My data is always in a single data.frame, with information for multiple entities contained within it. There is always 1 or more indexing variables to identify, group or split the data.

The test data used here (big_df) is just a small subset of the mtcars data set.

Table 1: `big_df` data.frame. This is just a cutdown version of **mtcars** Using **cyl** as an explicit ID column
ID	mpg	disp
6	21.0	160
6	18.1	225
8	10.4	460
8	18.7	360
8	14.3	360
8	15.0	301

The complex function to run on the data.frame for each entity

This is a (dummy) function to run on the data.frame for each entity.

This function is usually quite complex and consists of multiple processing steps to produce a result.

I am also interested in whether or not this inner function has access to the ID of the entity i.e. the grouping variable. The dplyr::do() approach does have access to the grouping variable, but other methods may not.

complex_func <- function(df) {
  df$N           <- nrow(df)
  df$func_has_ID <- 'ID' %in% colnames(df)
  df
}

Split-Apply-Combine - Prehistoric times - `split`, `lapply`, `do.call(rbind, ...)`

In the dark ages before dplyr and pipes, the code looked like this.

split_df       <- split(big_df, big_df$ID)
result_list_df <- lapply(split_df, complex_func)
result_df      <- do.call(rbind, result_list_df)

Table 2: Prehistoric (pre-dply) with base R
ID	mpg	disp	N	func_has_ID
6	21.0	160	2	TRUE
6	18.1	225	2	TRUE
8	10.4	460	4	TRUE
8	18.7	360	4	TRUE
8	14.3	360	4	TRUE
8	15.0	301	4	TRUE

Notes

the data.frame passed to complex_func() contains the ID variable

Split-Apply-Combine - Stone Age with plyr - `plyr::ddply`

One plyr function call to do the split, apply and combine. Hasn’t been updated since 2016. C

result_df <- plyr::ddply(big_df, "ID", complex_func)

Table 3: plyr!!
ID	mpg	disp	N	func_has_ID
6	21.0	160	2	TRUE
6	18.1	225	2	TRUE
8	10.4	460	4	TRUE
8	18.7	360	4	TRUE
8	14.3	360	4	TRUE
8	15.0	301	4	TRUE

Split-Apply-Combine - Early tidyverse era - `group_by`, `do`

In the early days of the tidyverse, the group_by/do approach was the way to go, and is the way I still write most of the code for split-apply-combine situations.

result_df <- big_df %>%
  group_by(ID) %>%
  do(complex_func(.)) %>%
  ungroup()

Table 4: Standard dplyr approach: group_by() then do()
ID	mpg	disp	N	func_has_ID
6	21.0	160	2	TRUE
6	18.1	225	2	TRUE
8	10.4	460	4	TRUE
8	18.7	360	4	TRUE
8	14.3	360	4	TRUE
8	15.0	301	4	TRUE

Notes

the data.frame passed to complex_func() contains the ID variable
Explicit ungroup() required to remove grouping variable from result

Split-Apply-Combine - Early-mid tidyverse era `group_by` & `by_slice`

For a brief moment in time, purrr had a by_slice() function which offered the same features as dplyr::do().

This function was then relegated to purrrlyr as it wasn’t quite purrr and it wasn’t quite dplyr.

According to the purrrlyr NEWS file functions in this packages are unlikely to be updated, so using them would be probably be a mistake. This example is included for posterity.

result_df <- big_df %>%
  group_by(ID) %>%
  purrrlyr::by_slice(~complex_func(.x), .collate = 'rows')

Table 5: Results of the using the soon-to-be-dead(?) purrrly **by_slice**
ID	mpg	disp	N	func_has_ID
6	21.0	160	2	FALSE
6	18.1	225	2	FALSE
8	10.4	460	4	FALSE
8	18.7	360	4	FALSE
8	14.3	360	4	FALSE
8	15.0	301	4	FALSE

Notes

the data.frame passed to complex_func() does not contain the ID variable
resulting data.frame does not have any grouping variables, and therefore no explicit ungroup() is required
The purrrlyr NEWS.md file does however offer the advice that instead of by_slice, the preferred method is a combination of tidyr::nest() and dplyr::mutate() using an inner purrr::map

Split-Apply-Combine - Current era tidyverse: `group_by`, `nest`, `mutate(map())`

The current suggested route in the tidyverse is to nest the data, and then operate on the list column by mutating it via purrr::map.

result_df <- big_df %>%
  group_by(ID) %>%
  nest() %>%
  mutate(data = purrr::map(data, complex_func)) %>%
  unnest()

## Warning: `cols` is now required.
## Please use `cols = c(data)`

Table 6: Current accepted practice: group_by, nest, mutate(map()
ID	mpg	disp	N	func_has_ID
6	21.0	160	2	FALSE
6	18.1	225	2	FALSE
8	10.4	460	4	FALSE
8	18.7	360	4	FALSE
8	14.3	360	4	FALSE
8	15.0	301	4	FALSE

Notes

the data.frame passed to complex_func() does not contain the ID variable
an explicit unnest is required at the end to get the data back in its original form
having to both mutate and map seems an extra step over all other methods.

Split-Apply-Combine - tidyverse/base hybrid: `split`, `map_dfr`

This hybrid approach uses split() to create a list of data.frames, and then uses map_dfr to map a function over each data.frame and then combine (by rows) into a single data.frame.

result_df <- big_df %>%
  split(.$ID) %>%
  purrr::map_dfr(complex_func)

Table 7: Hybrid base R/tidyverse approach
ID	mpg	disp	N	func_has_ID
6	21.0	160	2	TRUE
6	18.1	225	2	TRUE
8	10.4	460	4	TRUE
8	18.7	360	4	TRUE
8	14.3	360	4	TRUE
8	15.0	301	4	TRUE

Notes

This seems pretty compact - except for the very un-tidyverse split(.$ID)
the data.frame passed to complex_func() contains the ID variable
No explicit ungrouping required.

Split-Apply-Combine - Alternative universe `data.table`

Update: MattSummersgill and michael_chirico suggested using data.table.

This doesn’t really fit within the scope of my search (I’m definitely in the tidyverse ecosystem), but it’s included for the sake of comparison.

library(data.table)
setDT(big_df)
result_df <- big_df[, complex_func(.SD), by = .(this_ID = ID), .SDcols=colnames(big_df)]

Table 8: Alterate universe approach with `data.table`
this_ID	ID	mpg	disp	N	func_has_ID
6	6	21.0	160	2	TRUE
6	6	18.1	225	2	TRUE
8	8	10.4	460	4	TRUE
8	8	18.7	360	4	TRUE
8	8	14.3	360	4	TRUE
8	8	15.0	301	4	TRUE

Notes

the data.frame passed to complex_func() contains the ID variable
No explicit ungrouping required.
Without being immersed in day-to-day use of data.table this solution is a little opaque to me.
There are issues around how data.table handles the ID column. Calling it one way gave 2 ID columns in the result I’ve explicitly set a new grouping variable name to avoid this.

Summary Table

Below is a summary table on how the split-apply-combine is achieved for various implementations.

A blank entry means that the action of split, apply or combine is handled by the previous entry.

Method	Split	Apply	Combine	Group var available in applied function
Prehistoric	split	lapply	do.call(rbind)	Yes
Stone Age	plyr::ddply			Yes
Early Tidyverse	dplyr::group_by	dplyr::do	dplyr::ungroup	Yes
Early-Mid Tidyverse	dplyr::group_by	purrrlyr::by_slice		No
Current Era Tidyverse	dplyr::group_by + tidyr::nest	dplyr::mutate + purrr::map	tidyr::unnest	No
Hybrid	split	purrr::map_dfr		Yes
Alternate Universe	data.table			Yes

Conclusion

My aim is to find a replacement for group_by/do now that dplyr::do() is “basically deprecated”.

I looked at a number of tidyverse/tidyr/purrr replacements for do() - if you know of a technique I missed, please let me know on twitter.

My least favourite technique is to use current era tidyverse with nest/mutate/map/unnest
- I find it too verbose with a high cognitive overhead
- have to nest and unnest then operate on a new data list-column
- the nested data.frame has a new data.frame column list which doesn’t contain the grouping variable
- applying the function to the data.frame subsets requires a mutate AND a map
Favourite replacement
- tidyverse/base hybrid with split/map_dfr
- short and sweet.
- No need for purrr::map within a mutate
- No need for dplyr::do() syntax with the . e.g. complex_func(.)

Outstanding issue: Why isn’t there a purrr version of the base function split? According to Hadley on github:

A function that acts rowwise on a data frame doesn’t seem like it should live in purrr.

Next step: Make my own tidyverse split_by?

Split-Apply-Combine: My search for a replacement for 'group_by + do'

Introduction

The test data - big_df

The complex function to run on the data.frame for each entity

Split-Apply-Combine - Prehistoric times - split, lapply, do.call(rbind, ...)

Split-Apply-Combine - Stone Age with plyr - plyr::ddply

Split-Apply-Combine - Early tidyverse era - group_by, do

Split-Apply-Combine - Early-mid tidyverse era group_by & by_slice

Split-Apply-Combine - Current era tidyverse: group_by, nest, mutate(map())

Split-Apply-Combine - tidyverse/base hybrid: split, map_dfr

Split-Apply-Combine - Alternative universe data.table

Summary Table

Conclusion

The test data - `big_df`

Split-Apply-Combine - Prehistoric times - `split`, `lapply`, `do.call(rbind, ...)`

Split-Apply-Combine - Stone Age with plyr - `plyr::ddply`

Split-Apply-Combine - Early tidyverse era - `group_by`, `do`

Split-Apply-Combine - Early-mid tidyverse era `group_by` & `by_slice`

Split-Apply-Combine - Current era tidyverse: `group_by`, `nest`, `mutate(map())`

Split-Apply-Combine - tidyverse/base hybrid: `split`, `map_dfr`

Split-Apply-Combine - Alternative universe `data.table`