tidyverse split(): The journey to the pull request
Last week I searched for a replacement for group_by + do,
and this ended with split + map_dfr being my favourite alternative. Conceptually it was the most compact representation of the idea (just 2 commands) and avoided
a the extra work that seemed necessary to nest a data.frame and operate on the nested data.
I then looked more closely at the split() function, in particular
the runtime characteristics and
its idioscrasies and noted that in the Base R split() function:
- runtime is quadratic in number of splitting variables - something nobody ever wants
 - runtime is quadratic in number of groups within each variable - something nobody ever wants
 - the splitting variable gets recycled if it’s not as long as the data.frame being split - something nobody ever wants
 - values corresponding to NA values of the split variable are completely dropped from the data - something nobody ever wants
 
After seeing these problems, I sketched out a tidyverse version of split(), which I called cleave_by, and wrote a post about how it solved some of split()'s issues
hadleywickham suggested that I submit the code to the tidyr package. So I tidied the code, and renamed the function to chop() and opened a pull request
The rest of this post briefly shows how chop() + map_dfr() is a workable replacement for group_by() + do().
cleave_by() tidied up => chop()
gshotwell suggested that cleave_by() ought to
respect any groupings on the data.frame and
hadleywickham suggested that the behaviour should
probably be similar to tidyr::nest().
With that in mind, I rewrote cleave_by() to be more nest()-like, and renamed it to chop() (as no-one really liked the verb cleave
and I wasn’t attached to it either).
library(rlang)
chop <- function(data, ...) {
  chop_vars <- unname(tidyselect::vars_select(names(data), ...))
  # Only use group vars if no chop vars specified
  if (is_empty(chop_vars)) {
    chop_vars <- dplyr::group_vars(data)
  }
  data <- dplyr::ungroup(data)   # Same as nest() - chopped data frames are ungrouped.
  data <- dplyr::as_tibble(data) # Ensure we consistently return a list of tibbles
  if (is_empty(chop_vars) || nrow(data) == 0) {
    return(list(data))
  }
  idx <- dplyr::group_indices(data, !!! syms(chop_vars))
  unname(split(data, idx))
}
Using chop() + map_dfr() to replace group_by() + do()
In the following (very simplified!) application of split-apply-combine, I show how group_by() + do() and
chop() + map_dfr() can be used
to apply complex_func() to mtcars subsetted into groups by the value of cyl.
Using chop() + map_dfr() turns out to be a tiny bit simpler than group_by() + do() as there is no need for the final ungrouping (which can be disasterous if you ever forget to do it!)
mtcars %>%
  select(mpg, cyl, disp) %>%
  group_by(cyl) %>%
  do(complex_func(.)) %>%
  ungroup()
## # A tibble: 32 x 4
##      mpg   cyl  disp new_value        
##    <dbl> <dbl> <dbl> <chr>            
##  1  22.8     4 108   cyl plus one is 5
##  2  24.4     4 147.  cyl plus one is 5
##  3  22.8     4 141.  cyl plus one is 5
##  4  32.4     4  78.7 cyl plus one is 5
##  5  30.4     4  75.7 cyl plus one is 5
##  6  33.9     4  71.1 cyl plus one is 5
##  7  21.5     4 120.  cyl plus one is 5
##  8  27.3     4  79   cyl plus one is 5
##  9  26       4 120.  cyl plus one is 5
## 10  30.4     4  95.1 cyl plus one is 5
## # … with 22 more rows
mtcars %>%
  select(mpg, cyl, disp) %>%
  chop(cyl) %>%
  map_dfr(complex_func)
## # A tibble: 32 x 4
##      mpg   cyl  disp new_value        
##    <dbl> <dbl> <dbl> <chr>            
##  1  22.8     4 108   cyl plus one is 5
##  2  24.4     4 147.  cyl plus one is 5
##  3  22.8     4 141.  cyl plus one is 5
##  4  32.4     4  78.7 cyl plus one is 5
##  5  30.4     4  75.7 cyl plus one is 5
##  6  33.9     4  71.1 cyl plus one is 5
##  7  21.5     4 120.  cyl plus one is 5
##  8  27.3     4  79   cyl plus one is 5
##  9  26       4 120.  cyl plus one is 5
## 10  30.4     4  95.1 cyl plus one is 5
## # … with 22 more rows
Conclusion
- A case for a tidyverse 
split()has been made. - The code has been written.
 - A pull request has been opened.
 - Now I wait…