{serializer}
serializer
is a package which demonstrates how to use R’s internal
serialization interface from C. The code is the minimum amount of code
required to do this, and I’ve inserted plenty of comments for guidance.
This package was developed to help me figure out the serialization process in R. It is perhaps only really interesting if you want to look at and/or steal the C code. It’s under the MIT license, so please feel free to re-use in your own projects.
You can install from GitHub with:
# install.package('remotes')
remotes::install_github('coolbutuseless/serializer')
Simple Example
library(serializer)
robj <- sample(10)
v1 <- serializer::marshall(robj)
v1
[1] 42 0a 03 00 00 00 02 00 04 00 00 05 03 00 05 00 00 00 55 54 46 2d 38 0d 00
[26] 00 00 0a 00 00 00 09 00 00 00 04 00 00 00 07 00 00 00 01 00 00 00 02 00 00
[51] 00 05 00 00 00 03 00 00 00 0a 00 00 00 06 00 00 00 08 00 00 00
serializer::unmarshall(v1)
[1] 9 4 7 1 2 5 3 10 6 8
What’s the upper bound on serialization speed?
calc_marshalled_size()
can be used to calculate the size of a serialized object,
but does not actually try and create the serialized object.
Because this does not do any memory allocation, or copying of bytes, the speed
of calc_marshalled_size()
should give an approximation of the maximum
throughput of the serialization process when using R’s internal serialization
mechanism.
The speeds below seem ridiculous, because at its core, serialization is just passing pointers + lengths to an output stream, and doing very very little actual memory allocation or copying.
N <- 1e7
obj1 <- data.frame(
x = sample(N),
y = runif(N)
)
obj2 <- do.call(rbind, replicate(1000, iris, simplify = FALSE))
obj3 <- sample(N)
obj4 <- sample(10)
n1 <- lobstr::obj_size(obj1)
n2 <- lobstr::obj_size(obj2)
n3 <- lobstr::obj_size(obj3)
n4 <- lobstr::obj_size(obj4)
res <- bench::mark(
calc_marshalled_size(obj1),
calc_marshalled_size(obj2),
calc_marshalled_size(obj3),
calc_marshalled_size(obj4),
check = FALSE
)
res %>%
mutate(MB = round(c(n1, n2, n3, n4)/1024/1024)) %>%
mutate(`GB/s` = round(MB/1024 / as.numeric(median), 1)) %>%
mutate(`itr/sec` = round(`itr/sec`)) %>%
select(expression, median, `itr/sec`, MB, `GB/s`) %>%
knitr::kable(caption = "Maximum possible throughput of serialization")
expression | median | itr/sec | MB | GB/s |
---|---|---|---|---|
calc_marshalled_size(obj1) | 11.42µs | 84786 | 114 | 9746.8 |
calc_marshalled_size(obj2) | 6.23µs | 163324 | 5 | 783.3 |
calc_marshalled_size(obj3) | 7.36µs | 137883 | 38 | 5043.4 |
calc_marshalled_size(obj4) | 2.52µs | 298593 | 0 | 0.0 |
Summary
R’s process of chopping up an object for serialization is fast.
The limiting factor in the speed of serialization is whatever the actual output stream does with the description of the bytes.