Base R split()
My prior post on tidyverse split-apply-combined ended with me favouring split + map_dfr
as a replacement for group_by + do
. In this post I look at some idioscrasies of the split function from Base R.
The main 2 gotchas with base R split()
:
- the splitting variable gets recycled if it’s not as long as the data.frame being split
- NA levels are dropped from the data
Idiosyncrasy 1: the splitting variable gets recycled if it’s not long enough
If trying to split()
a data.frame, the splitting factor needs to be the same length
as the data.frame.
If the splitting factor is shorter than the data, then split()
will assume (foolishly!) that
you want to keep re-cycling through the factor to make up as many rows as necessary.
Almost nobody ever wants this behaviour!!
R is nice enough to produce a warning about a length mismatch, but will do it for you anyway.
In the following example, note how the data is split alternately group 1 or 2, as split()
keeps cycling through the bad_factor
variable until it has finished processing the data.frame.
test_df <- data.frame(a = letters[1:6], good_factor=c(1, 1, 1, 1, 2, 2))
bad_factor <- c(1, 2)
split(test_df, bad_factor) # oops I've used bad_factor by mistake!
## $`1`
## a good_factor
## 1 a 1
## 3 c 1
## 5 e 2
##
## $`2`
## a good_factor
## 2 b 1
## 4 d 1
## 6 f 2
Idiosyncrasy 2: NA levels are dropped from the data
It’s very unlikely I want to throw away data unless I make a specific request to do so e.g. using keep
or filter
However, split()
assumes that you never want to keep an NA
level and just drops it during the splitting process.
In the following example, ideally I want 3 groups representing the 3 levels within good_factor
, i.e. 1, 2, and NA. However,
split just throws away all the data where good_factor
is NA.
test_df <- data.frame(a = letters[1:6], good_factor=c(1, 1, 2, 2, NA, NA))
split(test_df, test_df$good_factor)
## $`1`
## a good_factor
## 1 a 1
## 2 b 1
##
## $`2`
## a good_factor
## 3 c 2
## 4 d 2