Split-Apply-Combine
I tend to do quite a lot of coding where data is split into groups, operations performed independently on each group, and then the data re-assembled into a single entity.
This is the classic split-apply-combine as outlined in Hadley Wickham’s JStatSoft paper (pdf) and discussed in Jenny Bryan’s Stat545 notes
group_by
+ do
is “basically deprecated”
The original way I performed split-apply-combine in the tidyverse was with
group_by
and do
,
df <- dplyr::tribble(
~type, ~value,
NA , 1,
NA , 2,
1 , 3,
1 , 4,
2 , 5,
2 , 6
)
# Note: the actual applied functions are much much much more complex than this
trivial_func <- function(x) {
x %>% mutate(output = value + 1)
}
df %>%
group_by(type) %>%
do(trivial_func(.))
# A tibble: 6 x 3
# Groups: type [3]
type value output
<dbl> <dbl> <dbl>
1 1 3 4
2 1 4 5
3 2 5 6
4 2 6 7
5 NA 1 2
6 NA 2 3
However dplyr::do()
is “basically deprecated” (according
to Hadley Wickham on twitter).
@ijlyttle btw dplyr::do() is now basically deprecated in favour of the purrr approach
— Hadley Wickham (@hadleywickham) April 11, 2016
My Search for a replacement to dplyr+do in early 2018
wasn’t really successful, and in the end I proposed a addition to tidyr called
chop
(see the defunct pull request, and
the blog post).
dplyr v0.8
to the rescue!
In late 2018 Romain Francois implemented
a much better idea with a much better name in the dplyr 0.8
development branch.
The v0.8 release candidate details two functions I am particularly keen to try:
group_split()
- Split data frame by groupsgroup_map()
andgroup_walk
- purrr-style functions that can be used to iterate on grouped tibbles
group_split()
As detailed in 2 prior posts (here
and here), the
base split()
function has a lot of issue which make it problematic to use with the tidyverse - especially the
fact that it drops NA
groups completely!
The new group_split()
is a super-charged, tidyverse-aware version of split()
.
Combined with purrr::map_dfr()
, this looks it’s going to be part of
my preferred split-apply-combine technique.
df %>%
group_split(type) %>%
purrr::map_dfr(trivial_func)
# A tibble: 6 x 3
type value output
<dbl> <dbl> <dbl>
1 1 3 4
2 1 4 5
3 2 5 6
4 2 6 7
5 NA 1 2
6 NA 2 3
group_map()
group_map()
is a way of applying a function to each group.
The applied function should take at least 2 arguments:
- The data.frame for a single group, but without the grouping columns.
- A single-row data frame with the group columns for just this group
trivial_func2 <- function(df, group_info) {
df %>% mutate(
inner_cols = ncol(df),
group_rows = nrow(group_info),
is_grouped = is_grouped_df(df)
)
}
df %>%
group_by(type) %>%
group_map(trivial_func2)
[[1]]
# A tibble: 2 x 4
value inner_cols group_rows is_grouped
<dbl> <int> <int> <lgl>
1 3 1 1 FALSE
2 4 1 1 FALSE
[[2]]
# A tibble: 2 x 4
value inner_cols group_rows is_grouped
<dbl> <int> <int> <lgl>
1 5 1 1 FALSE
2 6 1 1 FALSE
[[3]]
# A tibble: 2 x 4
value inner_cols group_rows is_grouped
<dbl> <int> <int> <lgl>
1 1 1 1 FALSE
2 2 1 1 FALSE
Things to note in the output:
- The data.frames passed into
trivial_func2()
are- ungrouped.
- missing the grouping column (
type
)
- the
group_info
is a single row data.frame with just the group information.
This style of splitting the grouping-columns from the data-within-each-group is
similar to my mental model of how tidyr::nest()/unnest()
handles data. It’s not quite
how I think about my data, so I think group_split()
will be my jam.
TL;DR
group_by()
+do()
is “mostly deprecated”group_split()
+purrr::map_dfr()
is an excellent replacement.