Shortest unique prefix - second attempt

Shortest unique prefix - hack part #2

  • I want to find the shortest unique prefix for each string in a vector (N < 20)
    • i.e. in a vector of strings, what’s the shortest a string can be and still match only itself
  • This isn’t a computational bottleneck in the overall process, so there’s no need for heroic/drastic solutions.
  • I wasn’t going to write a trie implementation just to solve this problem.

Yesterday’s solution got a bit ugly.

Today I remembered there’s always the function you need in base R, and found pmatch() and charmatch() which do partial string matching.

charmatch seeks matches for the elements of its first argument among those of its second. If there is a single exact match or a unique partial match, then the index of the matching value is returned.

charmatch does all the heavy lifting of matching a full set of prefixes against the full set of strings. Since the creation of all the prefixes is done from shortest to longest, then the first prefix that matches only a single string is the shortest unique prefix for that string.


# Shortest Unique Prefix
# - Number of strings is always < 20
# - 'strings' input does not contain duplicates
# - this isn't a computational bottleneck in the process, so any neat solution
#   is acceptable
find_shortest_unique_prefix <- function(strings) {
  # Calculate all possible prefixes of all strings
  all_prefixes <- strings %>% 
    map(str_sub, end=seq(max(nchar(strings)))) %>%
  # find 'charmatch'es
  matches <- charmatch(all_prefixes, strings) 
  idx     <- !duplicated(matches) & matches!=0
find_shortest_unique_prefix(c("ab", "apple", "apart", 'b', 'ag'))
[1] "ab"  "app" "apa" "b"   "ag" 
find_shortest_unique_prefix(c('blue', 'black', 'bold'))
[1] "blu" "bla" "bo"