Base R split has issues - Part 2 idiosyncrasies

Base R `split()`

My prior post on tidyverse split-apply-combined ended with me favouring split + map_dfr as a replacement for group_by + do. In this post I look at some idioscrasies of the split function from Base R.

The main 2 gotchas with base R split():

the splitting variable gets recycled if it’s not as long as the data.frame being split
NA levels are dropped from the data

Idiosyncrasy 1: the splitting variable gets recycled if it’s not long enough

If trying to split() a data.frame, the splitting factor needs to be the same length as the data.frame.

If the splitting factor is shorter than the data, then split() will assume (foolishly!) that you want to keep re-cycling through the factor to make up as many rows as necessary. Almost nobody ever wants this behaviour!!

R is nice enough to produce a warning about a length mismatch, but will do it for you anyway.

In the following example, note how the data is split alternately group 1 or 2, as split() keeps cycling through the bad_factor variable until it has finished processing the data.frame.

test_df <- data.frame(a = letters[1:6], good_factor=c(1, 1, 1, 1, 2, 2))

bad_factor <- c(1, 2)

split(test_df, bad_factor)  # oops I've used bad_factor by mistake!

## $`1`
##   a good_factor
## 1 a           1
## 3 c           1
## 5 e           2
## 
## $`2`
##   a good_factor
## 2 b           1
## 4 d           1
## 6 f           2

Idiosyncrasy 2: NA levels are dropped from the data

It’s very unlikely I want to throw away data unless I make a specific request to do so e.g. using keep or filter

However, split() assumes that you never want to keep an NA level and just drops it during the splitting process.

In the following example, ideally I want 3 groups representing the 3 levels within good_factor, i.e. 1, 2, and NA. However, split just throws away all the data where good_factor is NA.

test_df <- data.frame(a = letters[1:6], good_factor=c(1, 1, 2, 2, NA, NA))
split(test_df, test_df$good_factor)

## $`1`
##   a good_factor
## 1 a           1
## 2 b           1
## 
## $`2`
##   a good_factor
## 3 c           2
## 4 d           2

Base R split has issues - Part 2 idiosyncrasies

Base R split()

Idiosyncrasy 1: the splitting variable gets recycled if it’s not long enough

Idiosyncrasy 2: NA levels are dropped from the data

Base R `split()`