Split-Apply-Combine: My search for a replacement for 'group_by + do'

Introduction

I currently process a lot of data a single entity at a time, but have a data.frame representing multiple entities as input.

I have specialist functions that do a lot of work on the data.frame for a single entity, so I want to split the original data.frame into multiple data.frames containing just one entity each and then process them one at a time.

This is a classic case of split-apply-combine, as outlined Hadley Wickham’s JStatSoft paper (pdf) and Jenny Bryan’s Stat545 notes

Currently I uses dplyr group_by then do to achieve, this.

But as of 2016, dplyr::do() is “basically deprecated” according to Hadley Wickham on Twitter:

do() is definitely going away in the long term, but I’m not yet sure we have comprehensive alternative solutions to all problems that do() solves.

(Also “going away” means that we won’t make improvements to it and we won’t mention it in documentation and tutorials, but the code will continue to exist for a number of years)

Given that the original tweet was from 2 years ago, and I’m still using group_by/do, it’s time I start searching for a usable “purrr approach” that suits my needs.

If there are tidyverse options I haven’t yet discovered, please let me know on twitter!

The test data - big_df

My data is always in a single data.frame, with information for multiple entities contained within it. There is always 1 or more indexing variables to identify, group or split the data.

The test data used here (big_df) is just a small subset of the mtcars data set.

Table 1: big_df data.frame. This is just a cutdown version of mtcars Using cyl as an explicit ID column
ID mpg disp
6 21.0 160
6 18.1 225
8 10.4 460
8 18.7 360
8 14.3 360
8 15.0 301

The complex function to run on the data.frame for each entity

This is a (dummy) function to run on the data.frame for each entity.

This function is usually quite complex and consists of multiple processing steps to produce a result.

I am also interested in whether or not this inner function has access to the ID of the entity i.e. the grouping variable. The dplyr::do() approach does have access to the grouping variable, but other methods may not.

complex_func <- function(df) {
df$N <- nrow(df) df$func_has_ID <- 'ID' %in% colnames(df)
df
}

Split-Apply-Combine - Prehistoric times - split, lapply, do.call(rbind, ...)

In the dark ages before dplyr and pipes, the code looked like this.

split_df       <- split(big_df, big_df$ID) result_list_df <- lapply(split_df, complex_func) result_df <- do.call(rbind, result_list_df) Table 2: Prehistoric (pre-dply) with base R ID mpg disp N func_has_ID 6 21.0 160 2 TRUE 6 18.1 225 2 TRUE 8 10.4 460 4 TRUE 8 18.7 360 4 TRUE 8 14.3 360 4 TRUE 8 15.0 301 4 TRUE Notes • the data.frame passed to complex_func() contains the ID variable Split-Apply-Combine - Stone Age with plyr - plyr::ddply One plyr function call to do the split, apply and combine. Hasn’t been updated since 2016. C result_df <- plyr::ddply(big_df, "ID", complex_func) Table 3: plyr!! ID mpg disp N func_has_ID 6 21.0 160 2 TRUE 6 18.1 225 2 TRUE 8 10.4 460 4 TRUE 8 18.7 360 4 TRUE 8 14.3 360 4 TRUE 8 15.0 301 4 TRUE Split-Apply-Combine - Early tidyverse era - group_by, do In the early days of the tidyverse, the group_by/do approach was the way to go, and is the way I still write most of the code for split-apply-combine situations. result_df <- big_df %>% group_by(ID) %>% do(complex_func(.)) %>% ungroup()  Table 4: Standard dplyr approach: group_by() then do() ID mpg disp N func_has_ID 6 21.0 160 2 TRUE 6 18.1 225 2 TRUE 8 10.4 460 4 TRUE 8 18.7 360 4 TRUE 8 14.3 360 4 TRUE 8 15.0 301 4 TRUE Notes • the data.frame passed to complex_func() contains the ID variable • Explicit ungroup() required to remove grouping variable from result Split-Apply-Combine - Early-mid tidyverse era group_by & by_slice For a brief moment in time, purrr had a by_slice() function which offered the same features as dplyr::do(). This function was then relegated to purrrlyr as it wasn’t quite purrr and it wasn’t quite dplyr. According to the purrrlyr NEWS file functions in this packages are unlikely to be updated, so using them would be probably be a mistake. This example is included for posterity. result_df <- big_df %>% group_by(ID) %>% purrrlyr::by_slice(~complex_func(.x), .collate = 'rows') Table 5: Results of the using the soon-to-be-dead(?) purrrly by_slice ID mpg disp N func_has_ID 6 21.0 160 2 FALSE 6 18.1 225 2 FALSE 8 10.4 460 4 FALSE 8 18.7 360 4 FALSE 8 14.3 360 4 FALSE 8 15.0 301 4 FALSE Notes • the data.frame passed to complex_func() does not contain the ID variable • resulting data.frame does not have any grouping variables, and therefore no explicit ungroup() is required • The purrrlyr NEWS.md file does however offer the advice that instead of by_slice, the preferred method is a combination of tidyr::nest() and dplyr::mutate() using an inner purrr::map Split-Apply-Combine - Current era tidyverse: group_by, nest, mutate(map()) The current suggested route in the tidyverse is to nest the data, and then operate on the list column by mutating it via purrr::map. result_df <- big_df %>% group_by(ID) %>% nest() %>% mutate(data = purrr::map(data, complex_func)) %>% unnest() ## Warning: cols is now required. ## Please use cols = c(data) Table 6: Current accepted practice: group_by, nest, mutate(map() ID mpg disp N func_has_ID 6 21.0 160 2 FALSE 6 18.1 225 2 FALSE 8 10.4 460 4 FALSE 8 18.7 360 4 FALSE 8 14.3 360 4 FALSE 8 15.0 301 4 FALSE Notes • the data.frame passed to complex_func() does not contain the ID variable • an explicit unnest is required at the end to get the data back in its original form • having to both mutate and map seems an extra step over all other methods. Split-Apply-Combine - tidyverse/base hybrid: split, map_dfr This hybrid approach uses split() to create a list of data.frames, and then uses map_dfr to map a function over each data.frame and then combine (by rows) into a single data.frame. result_df <- big_df %>% split(.$ID) %>%
purrr::map_dfr(complex_func) 
Table 7: Hybrid base R/tidyverse approach
ID mpg disp N func_has_ID
6 21.0 160 2 TRUE
6 18.1 225 2 TRUE
8 10.4 460 4 TRUE
8 18.7 360 4 TRUE
8 14.3 360 4 TRUE
8 15.0 301 4 TRUE

Notes

• This seems pretty compact - except for the very un-tidyverse split(.\$ID)
• the data.frame passed to complex_func() contains the ID variable
• No explicit ungrouping required.

Split-Apply-Combine - Alternative universe data.table

Update: MattSummersgill and michael_chirico suggested using data.table.

This doesn’t really fit within the scope of my search (I’m definitely in the tidyverse ecosystem), but it’s included for the sake of comparison.

library(data.table)
setDT(big_df)
result_df <- big_df[, complex_func(.SD), by = .(this_ID = ID), .SDcols=colnames(big_df)]
Table 8: Alterate universe approach with data.table
this_ID ID mpg disp N func_has_ID
6 6 21.0 160 2 TRUE
6 6 18.1 225 2 TRUE
8 8 10.4 460 4 TRUE
8 8 18.7 360 4 TRUE
8 8 14.3 360 4 TRUE
8 8 15.0 301 4 TRUE

Notes

• the data.frame passed to complex_func() contains the ID variable
• No explicit ungrouping required.
• Without being immersed in day-to-day use of data.table this solution is a little opaque to me.
• There are issues around how data.table handles the ID column. Calling it one way gave 2 ID columns in the result I’ve explicitly set a new grouping variable name to avoid this.

Summary Table

Below is a summary table on how the split-apply-combine is achieved for various implementations.

A blank entry means that the action of split, apply or combine is handled by the previous entry.

Method Split Apply Combine Group var available in applied function
Prehistoric split lapply do.call(rbind) Yes
Stone Age plyr::ddply Yes
Early Tidyverse dplyr::group_by dplyr::do dplyr::ungroup Yes
Early-Mid Tidyverse dplyr::group_by purrrlyr::by_slice No
Current Era Tidyverse dplyr::group_by + tidyr::nest dplyr::mutate + purrr::map tidyr::unnest No
Hybrid split purrr::map_dfr Yes
Alternate Universe data.table Yes

Conclusion

My aim is to find a replacement for group_by/do now that dplyr::do() is “basically deprecated”.

I looked at a number of tidyverse/tidyr/purrr replacements for do() - if you know of a technique I missed, please let me know on twitter.

• My least favourite technique is to use current era tidyverse with nest/mutate/map/unnest
• I find it too verbose with a high cognitive overhead
• have to nest and unnest then operate on a new data list-column
• the nested data.frame has a new data.frame column list which doesn’t contain the grouping variable
• applying the function to the data.frame subsets requires a mutate AND a map
• Favourite replacement
• tidyverse/base hybrid with split/map_dfr
• short and sweet.
• No need for purrr::map within a mutate
• No need for dplyr::do() syntax with the . e.g. complex_func(.)

Outstanding issue: Why isn’t there a purrr version of the base function split? According to Hadley on github:

A function that acts rowwise on a data frame doesn’t seem like it should live in purrr.

Next step: Make my own tidyverse split_by?