Run-Length Encoding and The Problem of NAs

Run-Length Encoding

Run-length encoding computes the lengths and values of runs of equal values in a vector

E.g. given a vector c(1, 1, 1, 8, 9, 9), run-length encoding will summarise the data to say that there are “three 1s, one 8 and two 9s”.

In R, this can be done with base::rle()

rle(c(1, 1, 1, 8, 9, 9))
Run Length Encoding
  lengths: int [1:3] 3 1 2
  values : num [1:3] 1 8 9

A problem with base::rle() - NA values

base::rle() checks for equality of one value and the next using ==, and if they are different then it signals the end of a run of values, and the start of the next.

The problem with == is that NA values are never considered to equal each other, which leads to multiple NA values in a row not being considered as a run. E.g.

rle(c(1, 1, 1, NA, NA, NA, 2, 2, 2, 2))
Run Length Encoding
  lengths: int [1:5] 3 1 1 1 4
  values : num [1:5] 1 NA NA NA 2

The logic used to explain this is usually along the lines of “NA values are unknown and so would never be equal to each other, so therefore it makes sense they are not considered as a run of identical values”.

That’s true: NA is never equal to anything:

NA == NA
[1] NA
NA == 1
[1] NA
NA == Inf
[1] NA

Rebuttals:

# While not equal to each other, NAs are identical()
identical(NA, NA)
[1] TRUE
# NAs are considered the same in unique()
unique(c(1, 1, 1, 2, 2, NA, NA, NA))
[1]  1  2 NA

This NA handling issue has been raised on the R-devel list many years ago 2011 and today 2020.

Also, Gabe Becker notes that the run-length encoding function in the Bioconductor - S4Vectors package does group the NAs.

I want NAs considered as equal to each other when doing run-length encoding

I acknowledge that NAs are never equal to each other or anything else, but when I do run-length encoding I want them to be grouped together.

I’d be interested to find a case when someone is doing run-length encoding and didn’t want the NAs grouped together. Hit me up on twitter if you can think of one!

rle2() - A drop-in replacement for rle() that handles NAs sanely

rle2() is a function for run-length encoding that treats NAs as identical, and thus groups them together.

This is a copy of the body of base::rle() with just 1 extra line and 2 changed lines to handle NA values sanely.

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' A drop-in replacement for \code{base::rle()} that treats all NAs as identical
#'
#' @param x an atomic vector
#'
#' @return An object of class \code{rle}
#'
#' @export
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
rle2 <- function (x)  {
  stopifnot("'x' must be a vector of an atomic type" = is.atomic(x))

  n <- length(x)
  if (n == 0L) {
    return(structure(list(
      lengths = integer(), values = x)
    ), class = 'rle')
  }

  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Where does next value not equal current value?
  # i.e. y will be TRUE at the last index before a change
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  y <- (x[-1L] != x[-n])

  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Since NAs are never equal to anything, NAs in 'x' will lead to NAs in 'y'.
  # These current NAs in 'y' tell use nothing - Set all NAs in y to FALSE
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  y[is.na(y)] <- FALSE

  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # When a value in x is NA and its successor is not (or vice versa) then that
  # should also count as a value change and location in 'y' should be set to TRUE
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  y <- y | xor(is.na(x[-1L]), is.na(x[-n]))

  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Any TRUE locations in 'y' are the end of a run, where the next value
  # changes to something different
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  i <- c(which(y), n)

  structure(list(
    lengths = diff(c(0L, i)),
    values  = x[i]
  ), class = 'rle')
}
x <- c(1, 1, 1, NA, NA, NA, NA, 10, 10)
rle(x)
Run Length Encoding
  lengths: int [1:6] 3 1 1 1 1 2
  values : num [1:6] 1 NA NA NA NA 10
rle2(x)
Run Length Encoding
  lengths: int [1:3] 3 4 2
  values : num [1:3] 1 NA 10