tidyverse split()
: The journey to the pull request
Last week I searched for a replacement for group_by + do
,
and this ended with split + map_dfr
being my favourite alternative. Conceptually it was the most compact representation of the idea (just 2 commands) and avoided
a the extra work that seemed necessary to nest a data.frame and operate on the nested data.
I then looked more closely at the split()
function, in particular
the runtime characteristics and
its idioscrasies and noted that in the Base R split()
function:
- runtime is quadratic in number of splitting variables - something nobody ever wants
- runtime is quadratic in number of groups within each variable - something nobody ever wants
- the splitting variable gets recycled if it’s not as long as the data.frame being split - something nobody ever wants
- values corresponding to NA values of the split variable are completely dropped from the data - something nobody ever wants
After seeing these problems, I sketched out a tidyverse version of split()
, which I called cleave_by
, and wrote a post about how it solved some of split()'s
issues
hadleywickham suggested that I submit the code to the tidyr
package. So I tidied the code, and renamed the function to chop()
and opened a pull request
The rest of this post briefly shows how chop() + map_dfr()
is a workable replacement for group_by() + do()
.
cleave_by()
tidied up => chop()
gshotwell suggested that cleave_by()
ought to
respect any groupings on the data.frame and
hadleywickham suggested that the behaviour should
probably be similar to tidyr::nest()
.
With that in mind, I rewrote cleave_by()
to be more nest()
-like, and renamed it to chop()
(as no-one really liked the verb cleave
and I wasn’t attached to it either).
library(rlang)
chop <- function(data, ...) {
chop_vars <- unname(tidyselect::vars_select(names(data), ...))
# Only use group vars if no chop vars specified
if (is_empty(chop_vars)) {
chop_vars <- dplyr::group_vars(data)
}
data <- dplyr::ungroup(data) # Same as nest() - chopped data frames are ungrouped.
data <- dplyr::as_tibble(data) # Ensure we consistently return a list of tibbles
if (is_empty(chop_vars) || nrow(data) == 0) {
return(list(data))
}
idx <- dplyr::group_indices(data, !!! syms(chop_vars))
unname(split(data, idx))
}
Using chop() + map_dfr()
to replace group_by() + do()
In the following (very simplified!) application of split-apply-combine, I show how group_by() + do()
and
chop() + map_dfr()
can be used
to apply complex_func()
to mtcars
subsetted into groups by the value of cyl
.
Using chop() + map_dfr()
turns out to be a tiny bit simpler than group_by() + do()
as there is no need for the final ungrouping (which can be disasterous if you ever forget to do it!)
mtcars %>%
select(mpg, cyl, disp) %>%
group_by(cyl) %>%
do(complex_func(.)) %>%
ungroup()
## # A tibble: 32 x 4
## mpg cyl disp new_value
## <dbl> <dbl> <dbl> <chr>
## 1 22.8 4 108 cyl plus one is 5
## 2 24.4 4 147. cyl plus one is 5
## 3 22.8 4 141. cyl plus one is 5
## 4 32.4 4 78.7 cyl plus one is 5
## 5 30.4 4 75.7 cyl plus one is 5
## 6 33.9 4 71.1 cyl plus one is 5
## 7 21.5 4 120. cyl plus one is 5
## 8 27.3 4 79 cyl plus one is 5
## 9 26 4 120. cyl plus one is 5
## 10 30.4 4 95.1 cyl plus one is 5
## # … with 22 more rows
mtcars %>%
select(mpg, cyl, disp) %>%
chop(cyl) %>%
map_dfr(complex_func)
## # A tibble: 32 x 4
## mpg cyl disp new_value
## <dbl> <dbl> <dbl> <chr>
## 1 22.8 4 108 cyl plus one is 5
## 2 24.4 4 147. cyl plus one is 5
## 3 22.8 4 141. cyl plus one is 5
## 4 32.4 4 78.7 cyl plus one is 5
## 5 30.4 4 75.7 cyl plus one is 5
## 6 33.9 4 71.1 cyl plus one is 5
## 7 21.5 4 120. cyl plus one is 5
## 8 27.3 4 79 cyl plus one is 5
## 9 26 4 120. cyl plus one is 5
## 10 30.4 4 95.1 cyl plus one is 5
## # … with 22 more rows
Conclusion
- A case for a tidyverse
split()
has been made. - The code has been written.
- A pull request has been opened.
- Now I wait…