xxhashr

xxhashr provides simple access to the very fast hashing functions in xxHash.

The package provides support for hashing of vectors, matrices and arrays which contain raw, integer, real or logical values. To hash arbitrary R objects, use base::serialize() to first turn the object into a vector of raw bytes.

Limitations

It is the underlying data of the vector or matrix that is being hashed, and this does not include any notion of the container for that data. This means that a vector and array which contain the same data will hash to the same value - regardless of the dimensions.
xxHash v0.7.x includes the experimental xxh3 and xxh128 hash functionx. From the documentation: “The algorithm is currently in development, meaning its return values might still change in future versions. However, the API is stable, and can be used in production, typically for generation of ephemeral hashes (produced and consumed in same session)”.
xxHash is a non-cryptographic hash.

Installation

You can install from GitHub with:

# install.package('remotes')
remotes::install_github('coolbutuseless/xxhashr)

Optimisation via `Makevars`

To get the most out of what xxHash offers, it’s important to set some optimisation flags for your machine. The important compiler flags to set are -O3 and -march=native.

Here are 2 possible ways to do this:

Copy src/Makevars.custom to src/Makevars re-build package.
Edit your ~/.R/Makevars to include CFLAGS = -O3 -march=native (this will change flags for all future compilation, and should probably be used with caution)

Simple hashing

library(xxhashr)

vec <- raw(1e6)

xxhashr::xxhash32(vec)
[1] "5fff0bcb"

xxhashr::xxhash64(vec)
[1] "8a76d36d39caaecc"

xxhashr::xxhash128(vec)
[1] "118fde282639e0a3b5ce7c14a206fb68"

xxhashr::xxh3_64bits(vec)
[1] "b5ce7c14a206fb68"

Hashing 1 million raw bytes

xxhashr::xxh3_64bits() hashes data at around 25 GB/s for this input data size.

Since digest::digest() can hash raw bytes like xxhashr the speeds for xxhashr::xxhash64() and digest(algo = 'xxhash64') are about the same.

Click here to show/hide benchmark code

library(digest)

N   <- 1e6
vec <- raw(N)

res <- bench::mark(
  xxhash32(vec),
  xxhash64(vec),
  xxhash128(vec),
  xxh3_64bits(vec),
  digest(vec, algo = 'xxhash32', serialize = FALSE),
  digest(vec, algo = 'xxhash64', serialize = FALSE),
  check = FALSE
)

Table 1: Hashing 1 million raw bytes
expression	median	itr/sec	GB/s
xxhash32(vec)	158.7µs	6080	5.9
xxhash64(vec)	79.5µs	11952	11.7
xxhash128(vec)	34µs	27687	27.4
xxh3_64bits(vec)	32µs	29544	29.1
digest(vec, algo = “xxhash32”, serialize = FALSE)	310.7µs	3093	3.0
digest(vec, algo = “xxhash64”, serialize = FALSE)	114µs	8394	8.2

Hashing 1 thousand raw bytes

At smaller data sizes, the call overhead becomes more significant, and the overall throughput drops drastically.

xxhashr is about 20x faster than digest::digest() in this case.

Click here to show/hide benchmark code

N <- 1024L
vec <- raw(N)

res <- bench::mark(
  xxhash32(vec),
  xxhash64(vec),
  xxhash128(vec),
  xxh3_64bits(vec),
  digest(vec, algo = 'xxhash32', serialize = FALSE),
  digest(vec, algo = 'xxhash64', serialize = FALSE),
  check = FALSE
)

Table 2: Hashing 1000 raw bytes
expression	median	itr/sec	GB/s
xxhash32(vec)	1.06µs	872697	0.898
xxhash64(vec)	1.01µs	912220	0.946
xxhash128(vec)	1.02µs	908029	0.936
xxh3_64bits(vec)	1.04µs	885477	0.915
digest(vec, algo = “xxhash32”, serialize = FALSE)	31.87µs	29611	0.030
digest(vec, algo = “xxhash64”, serialize = FALSE)	31.44µs	29787	0.030

Hashing 1 million numeric values

In the case of integers, doubles and logical values, xxhashr will hash the data values directly, while digest::digest() must first serialize them to raw bytes.

As a result, xxhashr::xxhash128() is about 100x faster than digest::digest(), and operates at about 20 GB/s (on my machine).

Click here to show/hide benchmark code

N   <- 1e6
vec <- numeric(N)

res <- bench::mark(
  xxhash32(vec),
  xxhash64(vec),
  xxhash128(vec),
  xxh3_64bits(vec),
  digest(vec, algo = 'xxhash32', serialize = TRUE),
  digest(vec, algo = 'xxhash64', serialize = TRUE),
  check = FALSE
)

Table 3: Hashing 1 million numeric values
expression	median	itr/sec	GB/s
xxhash32(vec)	1.33ms	734	5.6
xxhash64(vec)	679.23µs	1437	11.0
xxhash128(vec)	313.31µs	2960	23.8
xxh3_64bits(vec)	295.4µs	3116	25.2
digest(vec, algo = “xxhash32”, serialize = TRUE)	36.94ms	27	0.2
digest(vec, algo = “xxhash64”, serialize = TRUE)	32.09ms	31	0.2

Hashing 1 million integer values

In the case of integers, doubles and logical values, xxhashr will hash the data values directly, while digest::digest() must first serialize them to raw bytes.

The hashFunction package offers a few hashes that work directly with integer values.

xxhashr::xxhash128() is about 100x faster than digest::digest(), and about 5x faster than hashFunction hashes.

Click here to show/hide benchmark code

library(hashFunction)
N <- 1e6
vec <- integer(N)

res <- bench::mark(
  xxhash32(vec),
  xxhash64(vec),
  xxhash128(vec),
  xxh3_64bits(vec),
  digest(vec, algo = 'xxhash32', serialize = TRUE),
  digest(vec, algo = 'xxhash64', serialize = TRUE),
  murmur3.32(vec),
  cityhash.64(vec),
  spooky.32(vec),
  check = FALSE
)

Table 4: Hashing 1 million integer values
expression	median	itr/sec	GB/s
xxhash32(vec)	640.3µs	1499	5.8
xxhash64(vec)	325.7µs	2916	11.4
xxhash128(vec)	140.2µs	6710	26.6
xxh3_64bits(vec)	128.1µs	7071	29.1
digest(vec, algo = “xxhash32”, serialize = TRUE)	13.8ms	71	0.3
digest(vec, algo = “xxhash64”, serialize = TRUE)	12.7ms	78	0.3
murmur3.32(vec)	563.2µs	1586	6.6
cityhash.64(vec)	636.8µs	947	5.9
spooky.32(vec)	602.3µs	1134	6.2

Acknowledgements

Yann Collett for releasing, maintaining and advancing xxHash
R Core for developing and maintaining such a great language.
CRAN maintainers, for patiently shepherding packages onto CRAN and maintaining the repository

Introducing 'xxhashr' - extremely fast hashing with xxHash

xxhashr

Limitations

Installation

Optimisation via Makevars

Simple hashing

Hashing 1 million raw bytes

Hashing 1 thousand raw bytes

Hashing 1 million numeric values

Hashing 1 million integer values

Related Software

Acknowledgements

Optimisation via `Makevars`