mikefc

Split-Apply-Combine

I tend to do quite a lot of coding where data is split into groups, operations performed independently on each group, and then the data re-assembled into a single entity.

This is the classic split-apply-combine as outlined in Hadley Wickham’s JStatSoft paper (pdf) and discussed in Jenny Bryan’s Stat545 notes

group_by + do is “basically deprecated”

The original way I performed split-apply-combine in the tidyverse was with group_by and do,

df <- dplyr::tribble(
  ~type, ~value,
  NA   , 1,
  NA   , 2,
  1    , 3,
  1    , 4,
  2    , 5, 
  2    , 6
)

# Note: the actual applied functions are much much much more complex than this
trivial_func <- function(x) {
  x %>% mutate(output = value + 1)
}


df %>%
  group_by(type) %>%
  do(trivial_func(.))
# A tibble: 6 x 3
# Groups:   type [3]
   type value output
  <dbl> <dbl>  <dbl>
1     1     3      4
2     1     4      5
3     2     5      6
4     2     6      7
5    NA     1      2
6    NA     2      3

However dplyr::do() is “basically deprecated” (according to Hadley Wickham on twitter).

My Search for a replacement to dplyr+do in early 2018 wasn’t really successful, and in the end I proposed a addition to tidyr called chop (see the defunct pull request, and the blog post).

dplyr v0.8 to the rescue!

In late 2018 Romain Francois implemented a much better idea with a much better name in the dplyr 0.8 development branch.

The v0.8 release candidate details two functions I am particularly keen to try:

  • group_split() - Split data frame by groups
  • group_map() and group_walk - purrr-style functions that can be used to iterate on grouped tibbles

group_split()

As detailed in 2 prior posts (here and here), the base split() function has a lot of issue which make it problematic to use with the tidyverse - especially the fact that it drops NA groups completely!

The new group_split() is a super-charged, tidyverse-aware version of split().

Combined with purrr::map_dfr(), this looks it’s going to be part of my preferred split-apply-combine technique.

df %>%
  group_split(type) %>%
  purrr::map_dfr(trivial_func)
# A tibble: 6 x 3
   type value output
  <dbl> <dbl>  <dbl>
1     1     3      4
2     1     4      5
3     2     5      6
4     2     6      7
5    NA     1      2
6    NA     2      3

group_map()

group_map() is a way of applying a function to each group.

The applied function should take at least 2 arguments:

  1. The data.frame for a single group, but without the grouping columns.
  2. A single-row data frame with the group columns for just this group
trivial_func2 <- function(df, group_info) {
  df %>% mutate(
    inner_cols = ncol(df),
    group_rows = nrow(group_info),
    is_grouped = is_grouped_df(df)
  )
}
df %>%
  group_by(type) %>%
  group_map(trivial_func2)
# A tibble: 6 x 5
# Groups:   type [3]
   type value inner_cols group_rows is_grouped
* <dbl> <dbl>      <int>      <int> <lgl>     
1     1     3          1          1 FALSE     
2     1     4          1          1 FALSE     
3     2     5          1          1 FALSE     
4     2     6          1          1 FALSE     
5    NA     1          1          1 FALSE     
6    NA     2          1          1 FALSE     

Things to note in the output:

  • The data.frames passed into trivial_func2() are
    • ungrouped.
    • missing the grouping column (type)
  • the group_info is a single row data.frame with just the group information.

This style of splitting the grouping-columns from the data-within-each-group is similar to my mental model of how tidyr::nest()/unnest() handles data. It’s not quite how I think about my data, so I think group_split() will be my jam.

TL;DR

  • group_by() + do() is “mostly deprecated”
  • group_split() + purrr::map_dfr() is an excellent replacement.