xxhashr
xxhashr
provides simple access to the very fast hashing functions
in xxHash.
The package provides support for hashing of vectors, matrices and arrays
which contain raw, integer, real or logical values. To hash arbitrary R
objects, use base::serialize()
to first turn the object into
a vector of raw bytes.
Limitations
- It is the underlying data of the vector or matrix that is being hashed, and this does not include any notion of the container for that data. This means that a vector and array which contain the same data will hash to the same value - regardless of the dimensions.
xxHash v0.7.x
includes the experimentalxxh3
andxxh128
hash functionx. From the documentation: “The algorithm is currently in development, meaning its return values might still change in future versions. However, the API is stable, and can be used in production, typically for generation of ephemeral hashes (produced and consumed in same session)”.xxHash
is a non-cryptographic hash.
Installation
You can install from GitHub with:
# install.package('remotes')
remotes::install_github('coolbutuseless/xxhashr)
Optimisation via Makevars
To get the most out of what xxHash
offers, it’s important to set
some optimisation flags for your machine. The important compiler flags to set
are -O3
and -march=native
.
Here are 2 possible ways to do this:
- Copy
src/Makevars.custom
tosrc/Makevars
re-build package. - Edit your
~/.R/Makevars
to includeCFLAGS = -O3 -march=native
(this will change flags for all future compilation, and should probably be used with caution)
Simple hashing
library(xxhashr)
vec <- raw(1e6)
xxhashr::xxhash32(vec)
[1] "5fff0bcb"
xxhashr::xxhash64(vec)
[1] "8a76d36d39caaecc"
xxhashr::xxhash128(vec)
[1] "118fde282639e0a3b5ce7c14a206fb68"
xxhashr::xxh3_64bits(vec)
[1] "b5ce7c14a206fb68"
Hashing 1 million raw bytes
xxhashr::xxh3_64bits()
hashes data at around 25 GB/s for this input data size.
Since digest::digest()
can hash raw bytes like xxhashr
the speeds for xxhashr::xxhash64()
and digest(algo = 'xxhash64')
are about the same.
Click here to show/hide benchmark code
library(digest)
N <- 1e6
vec <- raw(N)
res <- bench::mark(
xxhash32(vec),
xxhash64(vec),
xxhash128(vec),
xxh3_64bits(vec),
digest(vec, algo = 'xxhash32', serialize = FALSE),
digest(vec, algo = 'xxhash64', serialize = FALSE),
check = FALSE
)
expression | median | itr/sec | GB/s |
---|---|---|---|
xxhash32(vec) | 158.7µs | 6080 | 5.9 |
xxhash64(vec) | 79.5µs | 11952 | 11.7 |
xxhash128(vec) | 34µs | 27687 | 27.4 |
xxh3_64bits(vec) | 32µs | 29544 | 29.1 |
digest(vec, algo = “xxhash32”, serialize = FALSE) | 310.7µs | 3093 | 3.0 |
digest(vec, algo = “xxhash64”, serialize = FALSE) | 114µs | 8394 | 8.2 |
Hashing 1 thousand raw bytes
At smaller data sizes, the call overhead becomes more significant, and the overall throughput drops drastically.
xxhashr
is about 20x faster than digest::digest()
in this case.
Click here to show/hide benchmark code
N <- 1024L
vec <- raw(N)
res <- bench::mark(
xxhash32(vec),
xxhash64(vec),
xxhash128(vec),
xxh3_64bits(vec),
digest(vec, algo = 'xxhash32', serialize = FALSE),
digest(vec, algo = 'xxhash64', serialize = FALSE),
check = FALSE
)
expression | median | itr/sec | GB/s |
---|---|---|---|
xxhash32(vec) | 1.06µs | 872697 | 0.898 |
xxhash64(vec) | 1.01µs | 912220 | 0.946 |
xxhash128(vec) | 1.02µs | 908029 | 0.936 |
xxh3_64bits(vec) | 1.04µs | 885477 | 0.915 |
digest(vec, algo = “xxhash32”, serialize = FALSE) | 31.87µs | 29611 | 0.030 |
digest(vec, algo = “xxhash64”, serialize = FALSE) | 31.44µs | 29787 | 0.030 |
Hashing 1 million numeric values
In the case of integers, doubles and logical values, xxhashr
will hash the
data values directly, while digest::digest()
must first serialize them to
raw bytes.
As a result, xxhashr::xxhash128()
is about 100x faster than digest::digest()
,
and operates at about 20 GB/s (on my machine).
Click here to show/hide benchmark code
N <- 1e6
vec <- numeric(N)
res <- bench::mark(
xxhash32(vec),
xxhash64(vec),
xxhash128(vec),
xxh3_64bits(vec),
digest(vec, algo = 'xxhash32', serialize = TRUE),
digest(vec, algo = 'xxhash64', serialize = TRUE),
check = FALSE
)
expression | median | itr/sec | GB/s |
---|---|---|---|
xxhash32(vec) | 1.33ms | 734 | 5.6 |
xxhash64(vec) | 679.23µs | 1437 | 11.0 |
xxhash128(vec) | 313.31µs | 2960 | 23.8 |
xxh3_64bits(vec) | 295.4µs | 3116 | 25.2 |
digest(vec, algo = “xxhash32”, serialize = TRUE) | 36.94ms | 27 | 0.2 |
digest(vec, algo = “xxhash64”, serialize = TRUE) | 32.09ms | 31 | 0.2 |
Hashing 1 million integer values
In the case of integers, doubles and logical values, xxhashr
will hash the
data values directly, while digest::digest()
must first serialize them to
raw bytes.
The hashFunction
package offers a few hashes that work directly with integer values.
xxhashr::xxhash128()
is about 100x faster than digest::digest()
, and about
5x faster than hashFunction
hashes.
Click here to show/hide benchmark code
library(hashFunction)
N <- 1e6
vec <- integer(N)
res <- bench::mark(
xxhash32(vec),
xxhash64(vec),
xxhash128(vec),
xxh3_64bits(vec),
digest(vec, algo = 'xxhash32', serialize = TRUE),
digest(vec, algo = 'xxhash64', serialize = TRUE),
murmur3.32(vec),
cityhash.64(vec),
spooky.32(vec),
check = FALSE
)
expression | median | itr/sec | GB/s |
---|---|---|---|
xxhash32(vec) | 640.3µs | 1499 | 5.8 |
xxhash64(vec) | 325.7µs | 2916 | 11.4 |
xxhash128(vec) | 140.2µs | 6710 | 26.6 |
xxh3_64bits(vec) | 128.1µs | 7071 | 29.1 |
digest(vec, algo = “xxhash32”, serialize = TRUE) | 13.8ms | 71 | 0.3 |
digest(vec, algo = “xxhash64”, serialize = TRUE) | 12.7ms | 78 | 0.3 |
murmur3.32(vec) | 563.2µs | 1586 | 6.6 |
cityhash.64(vec) | 636.8µs | 947 | 5.9 |
spooky.32(vec) | 602.3µs | 1134 | 6.2 |
Acknowledgements
- Yann Collett for releasing, maintaining and advancing xxHash
- R Core for developing and maintaining such a great language.
- CRAN maintainers, for patiently shepherding packages onto CRAN and maintaining the repository