Introducing 'xxhashr' - extremely fast hashing with xxHash

xxhashr

xxhashr provides simple access to the very fast hashing functions in xxHash.

The package provides support for hashing of vectors, matrices and arrays which contain raw, integer, real or logical values. To hash arbitrary R objects, use base::serialize() to first turn the object into a vector of raw bytes.

Limitations

  • It is the underlying data of the vector or matrix that is being hashed, and this does not include any notion of the container for that data. This means that a vector and array which contain the same data will hash to the same value - regardless of the dimensions.
  • xxHash v0.7.x includes the experimental xxh3 and xxh128 hash functionx. From the documentation: “The algorithm is currently in development, meaning its return values might still change in future versions. However, the API is stable, and can be used in production, typically for generation of ephemeral hashes (produced and consumed in same session)”.
  • xxHash is a non-cryptographic hash.

Installation

You can install from GitHub with:

# install.package('remotes')
remotes::install_github('coolbutuseless/xxhashr)

Optimisation via Makevars

To get the most out of what xxHash offers, it’s important to set some optimisation flags for your machine. The important compiler flags to set are -O3 and -march=native.

Here are 2 possible ways to do this:

  1. Copy src/Makevars.custom to src/Makevars re-build package.
  2. Edit your ~/.R/Makevars to include CFLAGS = -O3 -march=native (this will change flags for all future compilation, and should probably be used with caution)

Simple hashing

library(xxhashr)

vec <- raw(1e6)

xxhashr::xxhash32(vec)
[1] "5fff0bcb"

xxhashr::xxhash64(vec)
[1] "8a76d36d39caaecc"

xxhashr::xxhash128(vec)
[1] "118fde282639e0a3b5ce7c14a206fb68"

xxhashr::xxh3_64bits(vec)
[1] "b5ce7c14a206fb68"

Hashing 1 million raw bytes

xxhashr::xxh3_64bits() hashes data at around 25 GB/s for this input data size.

Since digest::digest() can hash raw bytes like xxhashr the speeds for xxhashr::xxhash64() and digest(algo = 'xxhash64') are about the same.

Click here to show/hide benchmark code

library(digest)

N   <- 1e6
vec <- raw(N)

res <- bench::mark(
  xxhash32(vec),
  xxhash64(vec),
  xxhash128(vec),
  xxh3_64bits(vec),
  digest(vec, algo = 'xxhash32', serialize = FALSE),
  digest(vec, algo = 'xxhash64', serialize = FALSE),
  check = FALSE
)
Table 1: Hashing 1 million raw bytes
expression median itr/sec GB/s
xxhash32(vec) 158.7µs 6080 5.9
xxhash64(vec) 79.5µs 11952 11.7
xxhash128(vec) 34µs 27687 27.4
xxh3_64bits(vec) 32µs 29544 29.1
digest(vec, algo = “xxhash32”, serialize = FALSE) 310.7µs 3093 3.0
digest(vec, algo = “xxhash64”, serialize = FALSE) 114µs 8394 8.2

Hashing 1 thousand raw bytes

At smaller data sizes, the call overhead becomes more significant, and the overall throughput drops drastically.

xxhashr is about 20x faster than digest::digest() in this case.

Click here to show/hide benchmark code

N <- 1024L
vec <- raw(N)

res <- bench::mark(
  xxhash32(vec),
  xxhash64(vec),
  xxhash128(vec),
  xxh3_64bits(vec),
  digest(vec, algo = 'xxhash32', serialize = FALSE),
  digest(vec, algo = 'xxhash64', serialize = FALSE),
  check = FALSE
)
Table 2: Hashing 1000 raw bytes
expression median itr/sec GB/s
xxhash32(vec) 1.06µs 872697 0.898
xxhash64(vec) 1.01µs 912220 0.946
xxhash128(vec) 1.02µs 908029 0.936
xxh3_64bits(vec) 1.04µs 885477 0.915
digest(vec, algo = “xxhash32”, serialize = FALSE) 31.87µs 29611 0.030
digest(vec, algo = “xxhash64”, serialize = FALSE) 31.44µs 29787 0.030

Hashing 1 million numeric values

In the case of integers, doubles and logical values, xxhashr will hash the data values directly, while digest::digest() must first serialize them to raw bytes.

As a result, xxhashr::xxhash128() is about 100x faster than digest::digest(), and operates at about 20 GB/s (on my machine).

Click here to show/hide benchmark code

N   <- 1e6
vec <- numeric(N)

res <- bench::mark(
  xxhash32(vec),
  xxhash64(vec),
  xxhash128(vec),
  xxh3_64bits(vec),
  digest(vec, algo = 'xxhash32', serialize = TRUE),
  digest(vec, algo = 'xxhash64', serialize = TRUE),
  check = FALSE
)
Table 3: Hashing 1 million numeric values
expression median itr/sec GB/s
xxhash32(vec) 1.33ms 734 5.6
xxhash64(vec) 679.23µs 1437 11.0
xxhash128(vec) 313.31µs 2960 23.8
xxh3_64bits(vec) 295.4µs 3116 25.2
digest(vec, algo = “xxhash32”, serialize = TRUE) 36.94ms 27 0.2
digest(vec, algo = “xxhash64”, serialize = TRUE) 32.09ms 31 0.2

Hashing 1 million integer values

In the case of integers, doubles and logical values, xxhashr will hash the data values directly, while digest::digest() must first serialize them to raw bytes.

The hashFunction package offers a few hashes that work directly with integer values.

xxhashr::xxhash128() is about 100x faster than digest::digest(), and about 5x faster than hashFunction hashes.

Click here to show/hide benchmark code

library(hashFunction)
N <- 1e6
vec <- integer(N)

res <- bench::mark(
  xxhash32(vec),
  xxhash64(vec),
  xxhash128(vec),
  xxh3_64bits(vec),
  digest(vec, algo = 'xxhash32', serialize = TRUE),
  digest(vec, algo = 'xxhash64', serialize = TRUE),
  murmur3.32(vec),
  cityhash.64(vec),
  spooky.32(vec),
  check = FALSE
)
Table 4: Hashing 1 million integer values
expression median itr/sec GB/s
xxhash32(vec) 640.3µs 1499 5.8
xxhash64(vec) 325.7µs 2916 11.4
xxhash128(vec) 140.2µs 6710 26.6
xxh3_64bits(vec) 128.1µs 7071 29.1
digest(vec, algo = “xxhash32”, serialize = TRUE) 13.8ms 71 0.3
digest(vec, algo = “xxhash64”, serialize = TRUE) 12.7ms 78 0.3
murmur3.32(vec) 563.2µs 1586 6.6
cityhash.64(vec) 636.8µs 947 5.9
spooky.32(vec) 602.3µs 1134 6.2

Acknowledgements

  • Yann Collett for releasing, maintaining and advancing xxHash
  • R Core for developing and maintaining such a great language.
  • CRAN maintainers, for patiently shepherding packages onto CRAN and maintaining the repository