Base R split()
My prior post on tidyverse split-apply-combined ended with me favouring split + map_dfr
as a replacement for group_by + do
. In prior posts I saw:
- Base R
split()
runtime is quadratic in number of splitting variables - (see post) - Base R
split()
runtime is quadratic in number of groups within each variable - (see post) split()
recycles splitting variables and silent dropping of allNA
levels (see post)
In the process of working on my pull request for chop()
I found I had even more issues with split()
. These are pretty minor, but still (imo) weird.
Order of split()
output seems wrong
split()
creates groups by cycling through the values of the first variable first, then the second variable, and so on.
names(split(mtcars, list(mtcars$cyl, mtcars$am)))
## [1] "4.0" "6.0" "8.0" "4.1" "6.1" "8.1"
However, I would have expected the first variable to be considered somehow the “most important” (most significant bit?) and for it to cycle the slowest/least, i.e.
## [1] "4.0" "4.1" "6.0" "6.1" "8.0" "8.1"
I also expected nest()
to order the groups logically, but it seems it produces the nested data.frame in the order in which the variables appear in the original data.
mtcars %>%
group_by(cyl, am) %>%
nest()
## # A tibble: 6 x 3
## # Groups: cyl, am [6]
## cyl am data
## <dbl> <dbl> <list<df[,9]>>
## 1 6 1 [3 × 9]
## 2 4 1 [8 × 9]
## 3 6 0 [4 × 9]
## 4 8 0 [12 × 9]
## 5 4 0 [3 × 9]
## 6 8 1 [2 × 9]
Is this unordered output a conscious decision for nest()
? Or was it a lazy decision to just not bother reorganising the output? Does anyone require the groups to be in the originally presented order?
split()
can do NA levels (with a bit of work)
If a splitting factor has any NA
values, split()
will usually just silently ignore these values.
test_df <- data_frame(x=factor(c(1, 2, NA)), y=letters[1:3], z=101:103)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
## # A tibble: 3 x 3
## x y z
## <fct> <chr> <int>
## 1 1 a 101
## 2 2 b 102
## 3 <NA> c 103
split(test_df, test_df$x)
## $`1`
## # A tibble: 1 x 3
## x y z
## <fct> <chr> <int>
## 1 1 a 101
##
## $`2`
## # A tibble: 1 x 3
## x y z
## <fct> <chr> <int>
## 1 2 b 102
However, if you go to the effort to not exclude NA
values from the factor levels, then split()
will keep all levels in the output!
test_df <- data_frame(x=factor(c(1, 2, NA), exclude = c()), y=letters[1:3], z=101:103)
split(test_df, test_df$x)
## $`1`
## # A tibble: 1 x 3
## x y z
## <fct> <chr> <int>
## 1 1 a 101
##
## $`2`
## # A tibble: 1 x 3
## x y z
## <fct> <chr> <int>
## 1 2 b 102
##
## $<NA>
## # A tibble: 1 x 3
## x y z
## <fct> <chr> <int>
## 1 <NA> c 103
I was surprised to find out that rather than generating a character string “NA” for the name of the list element, split()
has actually given me a real NA
is.na(names(split(test_df, test_df$x)))
## [1] FALSE FALSE TRUE
However, if you split on 2 variables, only 1 of which has a factor NA
level, then it will not create any NA labels. So in the case of a single variable with an NA level, you get an actual NA
as a list label, but for mutiple variables, the NA is converted to character.
names(split(test_df, list(test_df$x, test_df$y)))
## [1] "1.a" "2.a" "NA.a" "1.b" "2.b" "NA.b" "1.c" "2.c" "NA.c"
Conclusion
Today I learned you can have an actual NA
value as the name of a list or data.frame column.