lz4lite
lz4lite
provides access to the extremely fast compression in lz4
for performing in-memory compression.
The scope of this package is limited - it aims to provide functions for direct hashing of vectors
which contain raw, integer, real or logical values. If you wanted to
compress arbitrary R objects, you must first convert into a raw
vector representation using base::serialize()
.
For a more general solution to fast serialization of R objects, see the fst or qs packages.
Currently lz4 code provided with this package is v1.9.3.
Design Choices
lz4lite
will compress the data payload within a numeric-ish vector, and not the
R object itself.
Limitations
- As it is the data payload of the vector that is being compressed, this does not include any notion of the container for that data i.e dimensions or other attributes are not compressed with the data.
- Values must be of type: raw, integer, real or logical.
- Decompressed values are always returned as a vector i.e. all dimensional information is lost during compression.
What’s in the box
lz4compress()
- compress the data within a vector of raw, integer, real or logical values
- set
use_hc = TRUE
to use the High Compression variant of LZ4. This variant can be slow to compress, but with higher compression ratios, and it retains the fast decompression speed i.e. multiple gigabytes per second!
lz4decompress()
- decompress a compressed representation that was created withlz4compress()
Installation
You can install from GitHub with:
# install.package('remotes')
remotes::install_github('coolbutuseless/lz4lite)
Compressing 1 million Integers
lz4lite
supports the direct compression of raw, integer, real and logical vectors.
On this test data, compression speed is ~600 MB/s, and decompression speed is ~3GB/s
library(lz4lite)
N <- 1e6
input_ints <- (sample(seq(1:5), N, prob = (1:5)^2, replace = TRUE))
compressed_lo <- lz4_compress(input_ints)
compressed_hi <- lz4_compress(input_ints, use_hc = TRUE, hc_level = 12)
Click here to show/hide benchmark code
library(lz4lite)
res <- bench::mark(
lz4_compress(input_ints, acc = 1),
lz4_compress(input_ints, acc = 10),
lz4_compress(input_ints, acc = 20),
lz4_compress(input_ints, acc = 50),
lz4_compress(input_ints, acc = 100),
lz4_compress(input_ints, use_hc = TRUE, hc_level = 1),
lz4_compress(input_ints, use_hc = TRUE, hc_level = 2),
lz4_compress(input_ints, use_hc = TRUE, hc_level = 4),
lz4_compress(input_ints, use_hc = TRUE, hc_level = 8),
lz4_compress(input_ints, use_hc = TRUE, hc_level = 12),
check = FALSE
)
expression | median | itr/sec | MB/s | compression_ratio |
---|---|---|---|---|
lz4_compress(input_ints, acc = 1) | 6.36ms | 157 | 599.7 | 0.306 |
lz4_compress(input_ints, acc = 10) | 6.19ms | 162 | 616.6 | 0.306 |
lz4_compress(input_ints, acc = 20) | 6.11ms | 163 | 624.5 | 0.306 |
lz4_compress(input_ints, acc = 50) | 6.13ms | 162 | 622.1 | 0.306 |
lz4_compress(input_ints, acc = 100) | 6.19ms | 163 | 616.7 | 0.306 |
lz4_compress(input_ints, use_hc = TRUE, hc_level = 1) | 34.38ms | 29 | 110.9 | 0.294 |
lz4_compress(input_ints, use_hc = TRUE, hc_level = 2) | 33.76ms | 29 | 113.0 | 0.294 |
lz4_compress(input_ints, use_hc = TRUE, hc_level = 4) | 67.82ms | 15 | 56.3 | 0.233 |
lz4_compress(input_ints, use_hc = TRUE, hc_level = 8) | 453.89ms | 2 | 8.4 | 0.167 |
lz4_compress(input_ints, use_hc = TRUE, hc_level = 12) | 11.34s | 0 | 0.3 | 0.122 |
Decompressing 1 million integers
Decompression speed varies slightly depending upon the compressed size.
Click here to show/hide benchmark code
res <- bench::mark(
lz4_decompress(compressed_lo),
lz4_decompress(compressed_hi)
)
expression | median | itr/sec | MB/s |
---|---|---|---|
lz4_decompress(compressed_lo) | 1.52ms | 633 | 2504.7 |
lz4_decompress(compressed_hi) | 1.14ms | 897 | 3349.9 |
Technical bits
How it works.
Compression
- Given a pointer to a standard numeric vector from R, an SEXP
- Ignoring any attributes or dimensions, compress the data payload within the object.
- Prefix the compressed data with an 8 byte header giving size and SEXP type
- Return a raw vector to the user containing the compressed bytes.
Decompression
- Strip off the header information
- Feed the raw bytes in to the C LZ4 decompression function
- Use the header to decide what sort of R object this is
- Uncompress the data into an R object of the correct type.
- Return the R object to the user
Note: matrices and arrays may also be passed to lz4_compress()
, but since
no attributes are retained (e.g. dims), the uncompressed object
returned by lz4_decompress()
can only be a simple vector.
Framing of the compressed data
lz4lite
does not use the standard LZ4 frame to store data.- The compressed representation is the compressed data prefixed with
a custom 8 byte header consisting of
- ‘LZ4’
- 1-byte for SEXP type i.e. INTSXP, RAWSXP, REALSXP or LGLSXP
- 4-bytes representing an integer i.e. the number of bytes in the original uncompressed data.
- This data representation
- is not compatible with the standard LZ4 frame format.
- is likely to evolve (so currently do not plan on compressing something in
one version of
lz4lite
and decompressing in another version).