zstdlite provides access to the very fast (and highly configurable) compression
in zstd for performing in-memory compression.
The scope of this package is limited - it aims to provide functions for direct hashing of vectors
which contain raw, integer, real, complex or logical values. It does this by operating on the
data payload within the vectors, and gains significant speed by not serializing the R object
itself. If you wanted to
compress arbitrary R objects, you must first manually convert into a raw
vector representation using
Currently zstd code provided with this package is v1.4.5.
zstdlite will compress the data payload within a numeric-ish vector, and not the
R object itself.
- As it is the data payload of the vector that is being compressed, this does not include any notion of the container for that data i.e dimensions or other attributes are not compressed with the data.
- Values must be of type: raw, integer, real, complex or logical.
- Decompressed values are always returned as a vector i.e. all dimensional information is lost during compression.
You can install from GitHub with:
# install.package('remotes') remotes::install_github('coolbutuseless/zstdlite)
Compressing 1 million Integers
zstdlite supports the direct compression of raw, integer, real, complex and logical vectors.
These vectors do not need to be serialized first to a raw representation, instead the data-payload within these vectors is compressed.
library(zstdlite) library(lz4lite) N <- 1e6 input_ints <- sample(1:5, N, prob = (1:5)^2, replace = TRUE) compressed_lo <- zstd_compress(input_ints) compressed_hi <- zstd_compress(input_ints, level = 100) compressed_lo_lz4 <- lz4_compress(input_ints, acc = 1) compressed_hi_lz4 <- lz4_compress(input_ints, use_hc = TRUE, hc_level = 12)
Click here to show/hide benchmark code
library(zstdlite) res <- bench::mark( zstd_compress(input_ints, level = -5), zstd_compress(input_ints, level = 1), zstd_compress(input_ints, level = 3), zstd_compress(input_ints, level = 10), zstd_compress(input_ints, level = 22), lz4_compress (input_ints, acc = 1), lz4_compress (input_ints, use_hc = TRUE, hc_level = 12), check = FALSE )
|zstdlite||zstd_compress(input_ints, level = -5)||14.29ms||69||266.9||0.150|
|zstdlite||zstd_compress(input_ints, level = 1)||14.5ms||69||263.1||0.131|
|zstdlite||zstd_compress(input_ints, level = 3)||14.28ms||70||267.1||0.131|
|zstdlite||zstd_compress(input_ints, level = 10)||87.92ms||11||43.4||0.106|
|zstdlite||zstd_compress(input_ints, level = 22)||2.33s||0||1.6||0.075|
|lz4lite||lz4_compress(input_ints, acc = 1)||6.4ms||158||596.0||0.306|
|lz4lite||lz4_compress(input_ints, use_hc = TRUE, hc_level = 12)||10.87s||0||0.4||0.122|
Decompressing 1 million integers
Click here to show/hide benchmark code
res <- bench::mark( zstd_decompress(compressed_lo), zstd_decompress(compressed_hi), lz4_decompress(compressed_lo_lz4), lz4_decompress(compressed_hi_lz4), check = FALSE )
Why only vectors of raw, integer, real, complex or logical?
R objects can be considered to consist of:
- a header - giving information like length and information for the garbage collector
- a body - data of some kind.
The vectors supported by
zstdlite are those vectors whose body consists of
data that is directly interpretable as a contiguous sequence of bytes representing
Other R objects (like lists or character vectors) are really collections of pointers to other objects, and do not live in memory as a contiguous sequence of byte data.
How it works.
- Given a pointer to a standard numeric vector from R (i.e. an SEXP pointer).
- Ignore any attributes or dimension information- just compress the data payload within the object.
- Prefix the compressed data with an 4 byte header giving the SEXP type.
- Return a raw vector to the user containing the compressed bytes.
- Strip off the 4-bytes of header information.
- Feed the other bytes in to the ZSTD decompression function written in C
- Use the header to decide what sort of R object this is.
- Decompress the data into an R object of the correct type.
- Return the R object to the user.
Note: matrices and arrays may also be passed to
zstd_compress(), but since
no attributes are retained (e.g. dims), the uncompressed object
zstd_decompress() can only be a simple vector.
Framing of the compressed data
zstdliteprefixes the standard Zstandard frame with some extra bytes.
- The compressed representation is the compressed data prefixed with
a custom 8 byte header consisting of
- 1-byte for SEXP type i.e. INTSXP, RAWSXP, REALSXP or LGLSXP
- This data representation
- is compatible with the standard Zstandard frame format if the leading bytes are removed.
- is likely to evolve (so currently do not plan on compressing something in
one version of
zstdliteand decompressing in another version.)