# Run-Length Encoding

*Run-length encoding* computes the lengths and values of runs of equal
values in a vector

E.g. given a vector `c(1, 1, 1, 8, 9, 9)`

, run-length
encoding will summarise the data to say that there are “three 1s, one 8 and two
9s”.

In R, this can be done with `base::rle()`

`rle(c(1, 1, 1, 8, 9, 9))`

```
Run Length Encoding
lengths: int [1:3] 3 1 2
values : num [1:3] 1 8 9
```

# A problem with `base::rle()`

- NA values

`base::rle()`

checks for equality of one value and the next using `==`

, and if
they are different then it signals the end of a run of values, and the start
of the next.

The problem with `==`

is that `NA`

values are never considered to equal each
other, which leads to multiple `NA`

values in a row *not* being considered as a run. E.g.

`rle(c(1, 1, 1, NA, NA, NA, 2, 2, 2, 2))`

```
Run Length Encoding
lengths: int [1:5] 3 1 1 1 4
values : num [1:5] 1 NA NA NA 2
```

The logic used to explain this is usually along the lines of “NA values are unknown and so would never be equal to each other, so therefore it makes sense they are not considered as a run of identical values”.

That’s true: NA is never equal to anything:

`NA == NA`

`[1] NA`

`NA == 1`

`[1] NA`

`NA == Inf`

`[1] NA`

Rebuttals:

```
# While not equal to each other, NAs are identical()
identical(NA, NA)
```

`[1] TRUE`

```
# NAs are considered the same in unique()
unique(c(1, 1, 1, 2, 2, NA, NA, NA))
```

`[1] 1 2 NA`

This NA handling issue has been raised on the R-devel list many years ago 2011 and today 2020.

Also, Gabe Becker notes that the run-length encoding function in the Bioconductor - S4Vectors package does group the NAs.

# I want NAs considered as equal to each other when doing run-length encoding

I acknowledge that NAs are never equal to each other or anything else, but when I do run-length encoding I want them to be grouped together.

I’d be interested to find a case when someone is doing run-length encoding
and *didn’t* want the NAs grouped together. Hit me up on twitter if you can think of one!

`rle2()`

- A drop-in replacement for `rle()`

that handles NAs sanely

`rle2()`

is a function for run-length encoding that treats
NAs as identical, and thus groups them together.

This is a copy of the body of `base::rle()`

with just **1 extra line** and
**2** changed lines to handle NA values sanely.

```
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' A drop-in replacement for \code{base::rle()} that treats all NAs as identical
#'
#' @param x an atomic vector
#'
#' @return An object of class \code{rle}
#'
#' @export
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
rle2 <- function (x) {
stopifnot("'x' must be a vector of an atomic type" = is.atomic(x))
n <- length(x)
if (n == 0L) {
return(structure(list(
lengths = integer(), values = x)
), class = 'rle')
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Where does next value not equal current value?
# i.e. y will be TRUE at the last index before a change
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
y <- (x[-1L] != x[-n])
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Since NAs are never equal to anything, NAs in 'x' will lead to NAs in 'y'.
# These current NAs in 'y' tell use nothing - Set all NAs in y to FALSE
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
y[is.na(y)] <- FALSE
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# When a value in x is NA and its successor is not (or vice versa) then that
# should also count as a value change and location in 'y' should be set to TRUE
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
y <- y | xor(is.na(x[-1L]), is.na(x[-n]))
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Any TRUE locations in 'y' are the end of a run, where the next value
# changes to something different
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
i <- c(which(y), n)
structure(list(
lengths = diff(c(0L, i)),
values = x[i]
), class = 'rle')
}
```

```
x <- c(1, 1, 1, NA, NA, NA, NA, 10, 10)
rle(x)
```

```
Run Length Encoding
lengths: int [1:6] 3 1 1 1 1 2
values : num [1:6] 1 NA NA NA NA 10
```

`rle2(x)`

```
Run Length Encoding
lengths: int [1:3] 3 4 2
values : num [1:3] 1 NA 10
```