Base R split has issues - Part 3 more idiosyncrasies

Base R `split()`

My prior post on tidyverse split-apply-combined ended with me favouring split + map_dfr as a replacement for group_by + do. In prior posts I saw:

Base R split() runtime is quadratic in number of splitting variables - (see post)
Base R split() runtime is quadratic in number of groups within each variable - (see post)
split() recycles splitting variables and silent dropping of all NA levels (see post)

In the process of working on my pull request for chop() I found I had even more issues with split(). These are pretty minor, but still (imo) weird.

Order of `split()` output seems wrong

split() creates groups by cycling through the values of the first variable first, then the second variable, and so on.

names(split(mtcars, list(mtcars$cyl, mtcars$am)))

## [1] "4.0" "6.0" "8.0" "4.1" "6.1" "8.1"

However, I would have expected the first variable to be considered somehow the “most important” (most significant bit?) and for it to cycle the slowest/least, i.e.

## [1] "4.0" "4.1" "6.0" "6.1" "8.0" "8.1"

I also expected nest() to order the groups logically, but it seems it produces the nested data.frame in the order in which the variables appear in the original data.

mtcars %>%
  group_by(cyl, am) %>% 
  nest()

## # A tibble: 6 x 3
## # Groups:   cyl, am [6]
##     cyl    am           data
##   <dbl> <dbl> <list<df[,9]>>
## 1     6     1        [3 × 9]
## 2     4     1        [8 × 9]
## 3     6     0        [4 × 9]
## 4     8     0       [12 × 9]
## 5     4     0        [3 × 9]
## 6     8     1        [2 × 9]

Is this unordered output a conscious decision for nest()? Or was it a lazy decision to just not bother reorganising the output? Does anyone require the groups to be in the originally presented order?

`split()` can do NA levels (with a bit of work)

If a splitting factor has any NA values, split() will usually just silently ignore these values.

test_df <- data_frame(x=factor(c(1, 2, NA)), y=letters[1:3], z=101:103)

## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

## # A tibble: 3 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 1     a       101
## 2 2     b       102
## 3 <NA>  c       103

split(test_df, test_df$x)

## $`1`
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 1     a       101
## 
## $`2`
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 2     b       102

However, if you go to the effort to not exclude NA values from the factor levels, then split() will keep all levels in the output!

test_df <- data_frame(x=factor(c(1, 2, NA), exclude = c()), y=letters[1:3], z=101:103)
split(test_df, test_df$x)

## $`1`
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 1     a       101
## 
## $`2`
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 2     b       102
## 
## $<NA>
## # A tibble: 1 x 3
##   x     y         z
##   <fct> <chr> <int>
## 1 <NA>  c       103

I was surprised to find out that rather than generating a character string “NA” for the name of the list element, split() has actually given me a real NA

is.na(names(split(test_df, test_df$x)))

## [1] FALSE FALSE  TRUE

However, if you split on 2 variables, only 1 of which has a factor NA level, then it will not create any NA labels. So in the case of a single variable with an NA level, you get an actual NA as a list label, but for mutiple variables, the NA is converted to character.

names(split(test_df, list(test_df$x, test_df$y)))

## [1] "1.a"  "2.a"  "NA.a" "1.b"  "2.b"  "NA.b" "1.c"  "2.c"  "NA.c"

Conclusion

Today I learned you can have an actual NA value as the name of a list or data.frame column.

Base R split has issues - Part 3 more idiosyncrasies

Base R split()

Order of split() output seems wrong

split() can do NA levels (with a bit of work)

Conclusion

Base R `split()`

Order of `split()` output seems wrong

`split()` can do NA levels (with a bit of work)