Shortest unique prefix - hack part #2
- I want to find the shortest unique prefix for each string in a vector (N < 20)
- i.e. in a vector of strings, what’s the shortest a string can be and still match only itself
- This isn’t a computational bottleneck in the overall process, so there’s no need for heroic/drastic solutions.
- I wasn’t going to write a trie implementation just to solve this problem.
Yesterday’s solution got a bit ugly.
Today I remembered there’s always the function you need in base R, and found pmatch()
and charmatch()
which
do partial string matching.
charmatch
seeks matches for the elements of its first argument among those of its second.
If there is a single exact match or a unique partial match, then the index of the matching value is returned.
charmatch
does all the heavy lifting of matching a full set of prefixes against the full set of strings. Since
the creation of all the prefixes is done from shortest to longest, then the first prefix that matches only a single
string is the shortest unique prefix for that string.
library(tidyverse)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Shortest Unique Prefix
#
# - Number of strings is always < 20
# - 'strings' input does not contain duplicates
# - this isn't a computational bottleneck in the process, so any neat solution
# is acceptable
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
find_shortest_unique_prefix <- function(strings) {
stopifnot(!any(duplicated(strings)))
# Calculate all possible prefixes of all strings
all_prefixes <- strings %>%
map(str_sub, end=seq(max(nchar(strings)))) %>%
flatten_chr()
# find 'charmatch'es
matches <- charmatch(all_prefixes, strings)
idx <- !duplicated(matches) & matches!=0
all_prefixes[idx]
}
find_shortest_unique_prefix(c("ab", "apple", "apart", 'b', 'ag'))
[1] "ab" "app" "apa" "b" "ag"
find_shortest_unique_prefix(c('blue', 'black', 'bold'))
[1] "blu" "bla" "bo"