Base R split has issues - Part 3 more idiosyncrasies

Base R split()

My prior post on tidyverse split-apply-combined ended with me favouring split + map_dfr as a replacement for group_by + do. In prior posts I saw:

  • Base R split() runtime is quadratic in number of splitting variables - (see post)
  • Base R split() runtime is quadratic in number of groups within each variable - (see post)
  • split() recycles splitting variables and silent dropping of all NA levels (see post)

In the process of working on my pull request for chop() I found I had even more issues with split(). These are pretty minor, but still (imo) weird.

Order of split() output seems wrong

split() creates groups by cycling through the values of the first variable first, then the second variable, and so on.

names(split(mtcars, list(mtcars$cyl, mtcars$am)))
## [1] "4.0" "6.0" "8.0" "4.1" "6.1" "8.1"

However, I would have expected the first variable to be considered somehow the “most important” (most significant bit?) and for it to cycle the slowest/least, i.e.

## [1] "4.0" "4.1" "6.0" "6.1" "8.0" "8.1"

I also expected nest() to order the groups logically, but it seems it produces the nested data.frame in the order in which the variables appear in the original data.

mtcars %>%
  group_by(cyl, am) %>% 
  nest()
## # A tibble: 6 x 3
## # Groups:   cyl, am [6]
##     cyl    am           data
##   <dbl> <dbl> <list<df[,9]>>
## 1     6     1        [3 × 9]
## 2     4     1        [8 × 9]
## 3     6     0        [4 × 9]
## 4     8     0       [12 × 9]
## 5     4     0        [3 × 9]
## 6     8     1        [2 × 9]

Is this unordered output a conscious decision for nest()? Or was it a lazy decision to just not bother reorganising the output? Does anyone require the groups to be in the originally presented order?

split() can do NA levels (with a bit of work)

If a splitting factor has any NA values, split() will usually just silently ignore these values.

test_df <- data_frame(x=factor(c(1, 2, NA)), y=letters[1:3], z=101:103)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
## # A tibble: 3 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 1     a       101
## 2 2     b       102
## 3 <NA>  c       103
split(test_df, test_df$x)
## $`1`
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 1     a       101
## 
## $`2`
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 2     b       102

However, if you go to the effort to not exclude NA values from the factor levels, then split() will keep all levels in the output!

test_df <- data_frame(x=factor(c(1, 2, NA), exclude = c()), y=letters[1:3], z=101:103)
split(test_df, test_df$x)
## $`1`
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 1     a       101
## 
## $`2`
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 2     b       102
## 
## $<NA>
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 <NA>  c       103

I was surprised to find out that rather than generating a character string “NA” for the name of the list element, split() has actually given me a real NA

is.na(names(split(test_df, test_df$x)))
## [1] FALSE FALSE  TRUE

However, if you split on 2 variables, only 1 of which has a factor NA level, then it will not create any NA labels. So in the case of a single variable with an NA level, you get an actual NA as a list label, but for mutiple variables, the NA is converted to character.

names(split(test_df, list(test_df$x, test_df$y)))
## [1] "1.a"  "2.a"  "NA.a" "1.b"  "2.b"  "NA.b" "1.c"  "2.c"  "NA.c"

Conclusion

Today I learned you can have an actual NA value as the name of a list or data.frame column.