Base64 encoding/decoding in plain R

Introduction

As part of working on {svgparser} I am adding support for the <image> tag in SVGs.

<image> tags can reference an image as a URL, or they can also contain an inline copy of the image encoded in an “ASCII-safe” format.

Base64 encoding is one commonly supported method for converting raw bytes into ASCII character strings.

The following code is a mini-exploration into how Base64 encoding/decoding could be done quickly(?) in base R.

TL;DR - base64 encoding/decoding can be done easily in base R, but not quickly.
I’ll be better off using one of the packages with C code for this task.

Not really a surprising result, but this was a fun diversion.

Base64 Encoding/Decoding

A high level description of Base64 encoding:

  1. View the sequence of raw bytes as a sequence of raw bits
  2. Group the bits six-at-a-time.
  3. Interpret each 6-bit grouping as an integer (0-63)
  4. Use this integer as index to choose from safe ASCII character set
    • 26 uppercase letters
    • 26 lowercase letters
    • 10 digits
    • ‘+’ and ‘\’ make up the last to characters
  5. Combine all these characters in a single string

Decoding is then just the reverse of this process.

Note that there is some messy stuff to do with padding that I’m not doing to describe here, but the code below mostly(?) deals with this.

Base64 encoding/decoding support in existing R packages

The {openssl} package and the {base64enc} package both support fast encoding/decoding of Base64 values

# input data should be raw bytes
data <- as.raw(sample(1:10))
data
#>  [1] 09 04 07 01 02 05 03 0a 06 08
# Encode to a single Base64 encoded string
b64 <- openssl::base64_encode(data)
b64
#> [1] "CQQHAQIFAwoGCA=="
# Decode the string back into raw data
openssl::base64_decode(b64)
#>  [1] 09 04 07 01 02 05 03 0a 06 08

Vanilla R code for Base64 Encoding/Decoding

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create lookup table to convert characters to their 6-bit-integer values
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
char_to_int <- function(vec) {
  unname(vapply(as.character(vec), utf8ToInt, integer(1)))
}


lookup_names <- c(LETTERS, letters, 0:9, '+', '/', '=')
lookup_values <- c(
  char_to_int(LETTERS) - char_to_int('A'),
  char_to_int(letters) - char_to_int('a') + 26L,
  char_to_int(0:9)     - char_to_int('0') + 52L,
  62L,
  63L,
  0L
)

lookup <- setNames(lookup_values, lookup_names)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' Decode a base64 string to a vector of raw bytes
#'
#' @param b64 Single character string containing base64 encoded values
#'
#' @return raw vector
#'
#' @example
#' b64 <- 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNk+A8AAQUBAScY42YAAAAASUVORK5CYII='
#' decode_base64(b64)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
decode_base64 <- function(b64) {
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Get a, integer 6-bit value for each of the characters in the string
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  chars    <- strsplit(b64, '')[[1]]
  six_bits <- lookup[chars]

  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Explode these integers into their individual bit values (32 bits per int)
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  bits <- intToBits(six_bits)

  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Convert to 32 row matrix
  # Truncate to 6-row matrix (ignoring bits 7-32).
  # Then reshape to 8-row matrix.
  # Note that 'intToBits()' output is little-endian, so switch it here to
  # big endian for easier logic
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mat <- matrix(as.integer(bits), nrow = 32)[6:1,]
  N <- length(mat)
  stopifnot(N %% 8 == 0)
  dim(mat) <- c(8, N/8)

  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Convert bits to bytes by multiplying out rows by 2^N and summing
  # along columns (i.e. each column is a bit-pattern for an 8-bit number)
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  raw_vec <- as.raw(colSums(mat * c(128L, 64L, 32L, 16L, 8L, 4L, 2L, 1L)))

  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Trim padded characters
  #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  if (endsWith(b64, "==")) {
    length(raw_vec) <- length(raw_vec) - 2L
  } else if (endsWith(b64, "=")) {
    length(raw_vec) <- length(raw_vec) - 1L
  }

  raw_vec
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' Encode a raw vector to a base64 encoded character string
#'
#' @param raw_vec raw vector
#'
#' @return single character string containing base64 encoded values
#'
#' @example
#' encode_base64(as.raw(1:20))
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
encode_base64 <- function(raw_vec) {
  stopifnot(is.raw(raw_vec))

  # work out if we need to pad the result to an 8-bit boundary
  npad <- 3L - (length(raw_vec) %% 3L)
  if (npad %in% 1:2) {
    length(raw_vec) <- length(raw_vec) + npad
  }

  # Create an 8 row matrix.  Each column is the bit-vector for an 8-bit number
  int <- as.integer(raw_vec)
  res <- as.integer(bitwAnd(rep(int, each = 8),  c(128L, 64L, 32L, 16L, 8L, 4L, 2L, 1L)) > 0)
  mat <- matrix(res, nrow = 8)

  # Reshape to a 6-row matrix (i.e. 6-bit numbers)
  N <- length(mat)
  stopifnot(N %% 6 == 0)
  dim(mat) <- c(6, N/6)

  # Calcualte the 6-bit numbers
  mat <- mat * c(32L, 16L, 8L, 4L, 2L, 1L)
  values <- colSums(mat)

  # Find the letter which is associated with each 6-bit number
  # and paste together into a string
  chars <- lookup_names[values + 1L]
  b64 <- paste(chars, collapse = "")

  # Replace padding bits with '=' signs
  if (npad == 1) {
    b64 <- gsub(".$", "=", b64)
  } else if (npad == 2) {
    b64 <- gsub("..$", "==", b64)
  }

  b64
}

Check results agree with {openssl}

for (i in seq(100)) {
  data    <- as.raw(sample(i))
  b64_me  <- encode_base64(data)
  b64_ref <- openssl::base64_encode(data)
  
  # Does my base64 string agree with `openssl`?
  stopifnot(identical(b64_me, b64_ref))
  
  # Does the decoded value match the original data?
  decoded <- decode_base64(b64_ref)
  stopifnot(identical(data, decoded))
}

print("All good!")
#> [1] "All good!"

Rough encoding speed comparison

Simple timing shows that the R code is 2x to 100x slower than the C code. Base R code is only competitive at really small sizes.

No real suprises.

res <- bench::press(
  N = c(10, 100, 10000),
  {
    data <- as.raw(sample(N) %% 256)
    bench::mark(
      openssl::base64_encode(data),
      encode_base64(data)
    )
  }
)
Table 1: Speed timing
expression N min median itr/sec mem_alloc
openssl::base64_encode(data) 10 22.31µs 28.02µs 27978.7464 0B
encode_base64(data) 10 48.68µs 59.7µs 14708.8449 2.53KB
openssl::base64_encode(data) 100 22.6µs 29.17µs 28753.0286 192B
encode_base64(data) 100 91.39µs 122.42µs 6996.1620 25.05KB
openssl::base64_encode(data) 10000 57.08µs 74.45µs 11249.6782 13.08KB
encode_base64(data) 10000 5.17ms 6.22ms 160.5602 2.36MB