Finding a length n needle in a haystack

Problem: Find index of a short vector (needle) inside a larger vector (haystack)

What’s a good way to find the location of a needle in a haystack?

haystack <- sample(0:12, size = 2000, replace = TRUE)
needle   <- c(2L, 10L, 8L)

The haystack may be large.
The needle is definitely a vector
Only the index for the first occurrence of needle within haystack is needed
Assuming that there is always at least 1 occurrence

Solution 1: `for` loop

Naively test every position in a for loop
Advantage of this method over a vectorised approach, is that it can terminate early i.e. as soon as a value is found

forloop_find <- function(needle, haystack) {
  n <- length(needle) - 1L
  for (i in seq(haystack)) {
    if (identical(haystack[i:(i+n)], needle)) {
      return(i)
    }
  }
}

forloop_find(needle, haystack)

## [1] 931

Solution 2: `vapply`

vapply_find <- function(needle, haystack) {
  n <- length(needle) - 1L
  which(vapply(seq(haystack), function(i) {identical(haystack[i:(i+n)], needle)}, FUN.VALUE = logical(1L)))[1L]
}

vapply_find(needle, haystack)

## [1] 931

Solution 3: `zoo::rollapply`

rollapply_find <- function(needle, haystack) {
  which(zoo::rollapply(haystack, length(needle), identical, needle))[1L]
}

rollapply_find(needle, haystack)

## [1] 931

Solution 4: stacked `lead()` finding

for each item in the needle
- shift the haystack and check for the needle element
- stack the shifted haystacks and sum them
- check where sum == length(needle)

lead_find <- function(needle, haystack) {
  v  <- haystack == needle[1]
  for (i in seq(2, length(needle))) {
    v <- v + (lead(haystack, i-1L) == needle[i])
  }
  
  which(v == length(needle))[1L]
}

lead_find(needle, haystack)

## [1] 931

Solution 5: `shift`

data.table::shift is a vectorised version of dplyr::lead()
Adapted from user “Jaap” on stackoverflow

shift_find <- function(needle, haystack) {
  shifted_haystack <- data.table::shift(haystack, type='lead', 0:(length(needle)-1))
  
  v <- Map('==', shifted_haystack, needle)
  v <- Reduce('+', v)
  
  which(v == length(needle))[1]
}

shift_find(needle, haystack)

## [1] 931

Solution 5: Rcpp `for` loop

update 2018-04-04
- Shaved a few % points off the runtime by using the value of j instead of an explicit match count.

Rcpp::cppFunction('int rcpp_find(NumericVector needle, NumericVector haystack) {
    int needle_length   = needle.size();
    int haystack_length = haystack.size();

    for (int i = 0; i < (haystack_length - needle_length); i++ ) {
        int j;
        for (j = 0; j < needle_length; j++ ) {
                if (needle[j] != haystack[i + j]) {
                    break;
                }
        }

        if (j == needle_length) {
            return(i+1);
        }
    }

    return(0);

    }')


rcpp_find(needle, haystack)

## [1] 931

Solution 6: `pile` find (by dmytro)

dmytro on twitter offered this one.

pile_find <- function(needle, haystack) { 
  index <- seq_along(haystack) 
  for (i in seq_along(needle)) { 
    pile <- haystack[index] 
    index <- index[pile==needle[i]] + 1L
  } 
  (index-length(needle))[1]
} 

pile_find(needle, haystack)

## [1] 931

Solution 7: Sieved find (by carroll_jono)

carroll_jono on twitter offered this one.
This is the fastest of the pure-R methods.

sieved_find <- function(needle, haystack) { 
  sieved <- which(haystack == needle[1L]) 
  for (i in seq.int(1L, length(needle)-1L)) {
    sieved <- sieved[haystack[sieved + i] == needle[i + 1L]]
  }
  sieved[1L] 
}

sieved_find(needle, haystack)

## [1] 931

Solution 8: Coerce to string and use grep

Convert each integer into a string e.g. c(1, 2, 3) becomes “001.002.003”
Current tests are with integers less than 1000, so I’m safe to use %03i as a format string.

grep_find <- function(needle, haystack) {
  expanded_width <- 4  # 3 digits plus a "."
  res <- stringr::str_locate_all(paste(sprintf("%03i", haystack), collapse="."), paste(sprintf("%03i", needle), collapse="."))[[1]][,1][[1]]
  
  (res-1)/expanded_width + 1
}

grep_find(needle, haystack)

## [1] 931

Solution 9: Idomatic C++ solution hrbrmstr

Rcpp::cppFunction('int idiomaticrcpp_find(NumericVector needle, NumericVector haystack) {
  NumericVector::iterator it;
  it = std::search(haystack.begin(), haystack.end(), needle.begin(), needle.end());
  int pos = it - haystack.begin() + 1;
  if (pos > haystack.size()) pos = -1;
  return(pos);
}')

idiomaticrcpp_find(needle, haystack)

## [1] 931

Solution 10: Rcpp boyer-moore jimhester

Boyer-Moore implementation in Rcpp by jimhester

# https://gist.github.com/jimhester/2d02d63d0459d9fdadcf15e08c3afc0a
Rcpp::sourceCpp("cache/2018-04-03-boyer-moore.cpp")
boyermoorecpp_find(needle, haystack)

## [1] 931

`needle` is in the middle of the `haystack`

Table 1: `needle` in middle of `haystack`. microbenchmark() timings for 1000 runs - times are in microseconds
expr	min	lq	mean	median	uq	max	neval
rollapply_find(needle, haystack)	64622.974	67946.554	78132.17233	80610.9880	83993.7715	261446.362	1000
vapply_find(needle, haystack)	15304.346	16611.288	20492.53295	17070.3485	18881.6200	47565.512	1000
lead_find(needle, haystack)	6057.255	7833.419	14920.53701	8950.0515	18850.4280	199884.346	1000
shift_find(needle, haystack)	2809.894	3108.799	6250.15134	3670.6565	4361.3965	183883.048	1000
forloop_find(needle, haystack)	4768.181	5269.440	6790.98958	5457.1060	5717.7130	29957.481	1000
grep_find(needle, haystack)	3844.324	4002.460	4402.40560	4065.0250	4173.2350	184611.679	1000
pile_find(needle, haystack)	86.056	97.890	289.28966	102.9325	109.6500	21766.544	1000
sieved_find(needle, haystack)	41.337	53.283	75.26908	57.4095	63.0490	7256.632	1000
rcpp_find(needle, haystack)	16.063	22.517	38.00685	27.7005	31.9475	5173.280	1000
idiomaticrcpp_find(needle, haystack)	12.401	20.561	98.40539	25.3500	29.7495	20635.513	1000
boyermoorecpp_find(needle, haystack)	13.654	23.306	86.19733	28.1015	32.6330	23749.599	1000

`needle` is very close to beginning of `haystack`

The for loop and rcpp are the only methods to benefit from a needle that is thought to be at the beginning of the haystack i.e. they can both return early as soon as they find match.

When the needle is close to the beginning of the haystack, then the searching time is very short, and the for loop is quite competitive with custom Rcpp code.

Table 2: `needle` near start of `haystack`. microbenchmark() timings for 1000 runs - times are in microseconds
expr	min	lq	mean	median	uq	max	neval
pile_find(needle, haystack)	71.184	126.9605	177.69453	138.3465	163.8710	13695.14	1000
sieved_find(needle, haystack)	46.268	69.6190	83.41948	76.2080	89.6610	223.01	1000
forloop_find(needle, haystack)	23.804	30.7580	53.60288	34.7525	45.3105	11729.11	1000
rcpp_find(needle, haystack)	14.483	47.9385	72.01494	53.3860	62.0890	12677.34	1000
boyermoorecpp_find(needle, haystack)	16.315	51.0210	97.70492	56.7190	67.5945	13618.62	1000

Conclusion

Summary

Rcpp is the fastest way. No real difference between my C-style of C++ and idiomatic C++
Boyer-Moore in Rcpp no faster than a vanilla solution for current benchmark sizes
For a pure R solution, the sieve_find() method is fastest and very competitive with Rcpp

Finding a length n needle in a haystack

Problem: Find index of a short vector (needle) inside a larger vector (haystack)

Solution 1: for loop

Solution 2: vapply

Solution 3: zoo::rollapply

Solution 4: stacked lead() finding

Solution 5: shift

Solution 5: Rcpp for loop

Solution 6: pile find (by dmytro)