mikefc

Problem - Finding the intersection of multiple integer vectors

  • I want to find the intersection of multiple integer vectors
  • Unbounded number of vectors, but usually 4-10
  • Unbounded size, but usually 10 to a million elements
  • Duplicated elements allowed in output
  • Order not important
  • Submissions welcomed! (including alternate test sets that demonstrate a different dominant solution)

Test set

  • Benchmarking data:
    • 10 vectors
    • Varying lengths
    • Elements sampled from range (1, 1e5)
  • I realise that this benchmarking data ignores the following cases
    • a few short vectors
    • very long vectors
    • vectors already sorted by size
    • vectors with large overlap
    • vectors with no overlap
set.seed(1)

source <- seq.int(1e5)
a <- sample(source, size=1.0e5, replace=TRUE)
b <- sample(source, size=0.9e5, replace=TRUE)
c <- sample(source, size=0.8e5, replace=TRUE)
d <- sample(source, size=0.7e5, replace=TRUE)
e <- sample(source, size=0.6e5, replace=TRUE)
f <- sample(source, size=0.5e5, replace=TRUE)
g <- sample(source, size=0.4e5, replace=TRUE)
h <- sample(source, size=0.3e5, replace=TRUE)
i <- sample(source, size=0.2e5, replace=TRUE)
j <- sample(source, size=0.1e5, replace=TRUE)

for loop

intersection_0_forloop <- function(...) {
  ll <- list(...)
  working_set <- ll[[1]]
  for (l in ll[-1]) {
    working_set <- intersect(working_set, l)
  }
  
  working_set
}
sort(unique(intersection_0_forloop(a, b, c, d, e, f, g, h, i)))
 [1]  2142  3076  3165  7103 17374 19906 20411 26568 32501 49176 49769
[12] 53436 58906 63001 64146 66826 67446 68860 71632 71789 73612 76367
[23] 78014 81028 86947 91082 98504

Reduce()

  • Acculmulate intersect() vector by vector using Reduce()
  • Under the hood, this works identical to the for loop example above…
  • … but it’s much better looking :)
intersection_1_reduce <- function(...) {
  Reduce(intersect, list(...))
}
sort(unique(intersection_1_reduce(a, b, c, d, e, f, g, h, i)))
 [1]  2142  3076  3165  7103 17374 19906 20411 26568 32501 49176 49769
[12] 53436 58906 63001 64146 66826 67446 68860 71632 71789 73612 76367
[23] 78014 81028 86947 91082 98504

Sort by length and then Reduce()

  • Acculmulate intersect() vector by vector using Reduce()
  • Start with the shortest vector first.
  • By ensuring that the intermediate result is always as short as possible, it should reduce the memory allocations.
intersection_2_reduce_sorted <- function(...) {
  ll   <- list(...)
  lens <- vapply(ll, length, integer(1))
  ll   <- ll[order(lens)]
  Reduce(intersect, ll)
}
sort(unique(intersection_2_reduce_sorted(a, b, c, d, e, f, g, h, i)))
 [1]  2142  3076  3165  7103 17374 19906 20411 26568 32501 49176 49769
[12] 53436 58906 63001 64146 66826 67446 68860 71632 71789 73612 76367
[23] 78014 81028 86947 91082 98504

Tally vector

  • Initialise a tally vector wtih length equal to the maximum integer across all the vectors.
  • For each element in each vector, increase the tally for that element by 1.
  • Find all counts in the tally vector which match the number of vectors given.
  • This will be increasingly inefficient as maxval gets larger and larger.
intersection_3_tally <- function(...) {
  ll     <- list(...)
  maxval <- max(vapply(ll, max, integer(1)))
  tally  <- integer(maxval)
  for (elements in ll) {
    tally[elements] <- tally[elements] + 1L
  }
  which(tally == length(ll))
}
sort(unique(intersection_3_tally(a, b, c, d, e, f, g, h, i)))
 [1]  2142  3076  3165  7103 17374 19906 20411 26568 32501 49176 49769
[12] 53436 58906 63001 64146 66826 67446 68860 71632 71789 73612 76367
[23] 78014 81028 86947 91082 98504

Rcpp function

  • An Rcpp integer-only version of intersect()
  • An R wrapper to go with it.
  • side effect: original vectors are sorted numerically.
Rcpp::cppFunction(
"std::vector<int> intersection_rcpp_core(IntegerVector v1, IntegerVector v2) {
  std::sort(v1.begin(), v1.end());
  std::sort(v2.begin(), v2.end());
  std::vector<int> v_intersection;

  std::set_intersection(v1.begin(), v1.end(),
                        v2.begin(), v2.end(),
                        std::back_inserter(v_intersection));
                        
  return v_intersection;
}
")


intersection_4_rcpp <- function(...) {
  Reduce(intersection_rcpp_core, list(...))
}
sort(unique(intersection_4_rcpp(a, b, c, d, e, f, g, h, i)))
 [1]  2142  3076  3165  7103 17374 19906 20411 26568 32501 49176 49769
[12] 53436 58906 63001 64146 66826 67446 68860 71632 71789 73612 76367
[23] 78014 81028 86947 91082 98504

Benchmark

bm <- bench::mark(
  intersection_0_forloop      (a, b, c, d, e, f, g, h, i, j),
  intersection_1_reduce       (a, b, c, d, e, f, g, h, i, j),
  intersection_2_reduce_sorted(a, b, c, d, e, f, g, h, i, j),
  intersection_3_tally        (a, b, c, d, e, f, g, h, i, j),
  intersection_4_rcpp         (a, b, c, d, e, f, g, h, i, j)
)
expression median itr/sec mem_alloc
intersection_0_forloop(a, b, c, d, e, f, g, h, i, j) 65.83ms 15.17716 10.55MB
intersection_1_reduce(a, b, c, d, e, f, g, h, i, j) 65.86ms 15.18267 10.56MB
intersection_2_reduce_sorted(a, b, c, d, e, f, g, h, i, j) 36.5ms 27.27522 7.24MB
intersection_3_tally(a, b, c, d, e, f, g, h, i, j) 7.55ms 132.92267 3.24MB
intersection_4_rcpp(a, b, c, d, e, f, g, h, i, j) 2.93ms 338.85640 357.12KB

Summary

  • Under-the-hood, the for loop and the Reduce() methods are identical and this is evidenced by their identical runtimes.
  • Sorting input vectors by length results in a 2x speed-up over the base case for this benchmark example (which is pretty pathological with the vectors in order of decreasing size)
  • Using a tally vector instead of the builtin intersect() results in a ~9x speedup over the base case.
  • Rcpp is clearly the fastest with least memory use.
    • This could be extended to be a pure Rcpp function with the functionality of the Reduce() done in C++
    • Could also implement the tally/memory technique in Rcpp
  • Contributions welcomed!