Introduction
As part of working on {svgparser}
I am adding support for the <image>
tag in SVGs.
<image>
tags can reference an image as a URL, or they can also contain an
inline copy of the image encoded in an “ASCII-safe” format.
Base64 encoding is one commonly supported method for converting raw bytes into ASCII character strings.
The following code is a mini-exploration into how Base64 encoding/decoding could be done quickly(?) in base R.
TL;DR - base64 encoding/decoding can be done easily in base R, but not quickly.
I’ll be better off using one of the packages with C code for this task.
Not really a surprising result, but this was a fun diversion.
Base64 Encoding/Decoding
A high level description of Base64 encoding:
- View the sequence of raw bytes as a sequence of raw bits
- Group the bits six-at-a-time.
- Interpret each 6-bit grouping as an integer (0-63)
- Use this integer as index to choose from safe ASCII character set
- 26 uppercase letters
- 26 lowercase letters
- 10 digits
- ‘+’ and ‘\’ make up the last to characters
- Combine all these characters in a single string
Decoding is then just the reverse of this process.
Note that there is some messy stuff to do with padding that I’m not doing to describe here, but the code below mostly(?) deals with this.
Base64 encoding/decoding support in existing R packages
The {openssl} package and the {base64enc} package both support fast encoding/decoding of Base64 values
# input data should be raw bytes
data <- as.raw(sample(1:10))
data
#> [1] 09 04 07 01 02 05 03 0a 06 08
# Encode to a single Base64 encoded string
b64 <- openssl::base64_encode(data)
b64
#> [1] "CQQHAQIFAwoGCA=="
# Decode the string back into raw data
openssl::base64_decode(b64)
#> [1] 09 04 07 01 02 05 03 0a 06 08
Vanilla R code for Base64 Encoding/Decoding
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create lookup table to convert characters to their 6-bit-integer values
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
char_to_int <- function(vec) {
unname(vapply(as.character(vec), utf8ToInt, integer(1)))
}
lookup_names <- c(LETTERS, letters, 0:9, '+', '/', '=')
lookup_values <- c(
char_to_int(LETTERS) - char_to_int('A'),
char_to_int(letters) - char_to_int('a') + 26L,
char_to_int(0:9) - char_to_int('0') + 52L,
62L,
63L,
0L
)
lookup <- setNames(lookup_values, lookup_names)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' Decode a base64 string to a vector of raw bytes
#'
#' @param b64 Single character string containing base64 encoded values
#'
#' @return raw vector
#'
#' @example
#' b64 <- 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNk+A8AAQUBAScY42YAAAAASUVORK5CYII='
#' decode_base64(b64)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
decode_base64 <- function(b64) {
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Get a, integer 6-bit value for each of the characters in the string
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
chars <- strsplit(b64, '')[[1]]
six_bits <- lookup[chars]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Explode these integers into their individual bit values (32 bits per int)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bits <- intToBits(six_bits)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Convert to 32 row matrix
# Truncate to 6-row matrix (ignoring bits 7-32).
# Then reshape to 8-row matrix.
# Note that 'intToBits()' output is little-endian, so switch it here to
# big endian for easier logic
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mat <- matrix(as.integer(bits), nrow = 32)[6:1,]
N <- length(mat)
stopifnot(N %% 8 == 0)
dim(mat) <- c(8, N/8)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Convert bits to bytes by multiplying out rows by 2^N and summing
# along columns (i.e. each column is a bit-pattern for an 8-bit number)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
raw_vec <- as.raw(colSums(mat * c(128L, 64L, 32L, 16L, 8L, 4L, 2L, 1L)))
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Trim padded characters
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
if (endsWith(b64, "==")) {
length(raw_vec) <- length(raw_vec) - 2L
} else if (endsWith(b64, "=")) {
length(raw_vec) <- length(raw_vec) - 1L
}
raw_vec
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' Encode a raw vector to a base64 encoded character string
#'
#' @param raw_vec raw vector
#'
#' @return single character string containing base64 encoded values
#'
#' @example
#' encode_base64(as.raw(1:20))
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
encode_base64 <- function(raw_vec) {
stopifnot(is.raw(raw_vec))
# work out if we need to pad the result to an 8-bit boundary
npad <- 3L - (length(raw_vec) %% 3L)
if (npad %in% 1:2) {
length(raw_vec) <- length(raw_vec) + npad
}
# Create an 8 row matrix. Each column is the bit-vector for an 8-bit number
int <- as.integer(raw_vec)
res <- as.integer(bitwAnd(rep(int, each = 8), c(128L, 64L, 32L, 16L, 8L, 4L, 2L, 1L)) > 0)
mat <- matrix(res, nrow = 8)
# Reshape to a 6-row matrix (i.e. 6-bit numbers)
N <- length(mat)
stopifnot(N %% 6 == 0)
dim(mat) <- c(6, N/6)
# Calcualte the 6-bit numbers
mat <- mat * c(32L, 16L, 8L, 4L, 2L, 1L)
values <- colSums(mat)
# Find the letter which is associated with each 6-bit number
# and paste together into a string
chars <- lookup_names[values + 1L]
b64 <- paste(chars, collapse = "")
# Replace padding bits with '=' signs
if (npad == 1) {
b64 <- gsub(".$", "=", b64)
} else if (npad == 2) {
b64 <- gsub("..$", "==", b64)
}
b64
}
Check results agree with {openssl}
for (i in seq(100)) {
data <- as.raw(sample(i))
b64_me <- encode_base64(data)
b64_ref <- openssl::base64_encode(data)
# Does my base64 string agree with `openssl`?
stopifnot(identical(b64_me, b64_ref))
# Does the decoded value match the original data?
decoded <- decode_base64(b64_ref)
stopifnot(identical(data, decoded))
}
print("All good!")
#> [1] "All good!"
Rough encoding speed comparison
Simple timing shows that the R code is 2x to 100x slower than the C code. Base R code is only competitive at really small sizes.
No real suprises.
res <- bench::press(
N = c(10, 100, 10000),
{
data <- as.raw(sample(N) %% 256)
bench::mark(
openssl::base64_encode(data),
encode_base64(data)
)
}
)
expression | N | min | median | itr/sec | mem_alloc |
---|---|---|---|---|---|
openssl::base64_encode(data) | 10 | 22.31µs | 28.02µs | 27978.7464 | 0B |
encode_base64(data) | 10 | 48.68µs | 59.7µs | 14708.8449 | 2.53KB |
openssl::base64_encode(data) | 100 | 22.6µs | 29.17µs | 28753.0286 | 192B |
encode_base64(data) | 100 | 91.39µs | 122.42µs | 6996.1620 | 25.05KB |
openssl::base64_encode(data) | 10000 | 57.08µs | 74.45µs | 11249.6782 | 13.08KB |
encode_base64(data) | 10000 | 5.17ms | 6.22ms | 160.5602 | 2.36MB |