Find five words in the wordle dictionary that contain 25 unique letters

While thinking about wordle I wondered if there existed any sets of five words in the wordle dictionary that would cover 25 unique letters of the alphabet.

Given there are twelve thousand words in the wordle dictionary, there are 12000^5 five-word sets. This is approx 2.5x10^20 - which means that if I could test a million five-word sets every second it would take well over a million years.

So I thought about how I might practically do this with recurstion and vicious pruning of candidates in the search tree.

All in base R!

General solution process

Develop data structures which are easy for R to compute on and help with pruning candidate words
At each step of choosing anew word, prune quickly all the words that could no longer be part of a possible 5-word set
- e.g. if abers is the first word, then the pool of remaining words to choose should immediately discard any word with an a, b, e, r or s.

Overivew of solution technique

Remove all words with repeated letters.
Build list of letters in each word. letters_in_word.
Build letter membership indices. The words_with_letter data structure.
Recursive searching (with early termination) with vicious pruning of the remaining candidate words at each stage.

Start with the wordle dictionary

The wordle dictionary contains 12972 five-letter words.

words <- wordle::wordle_dict
length(words)

#> [1] 12972

head(words)

#> [1] "aahed" "aalii" "aargh" "aarti" "abaca" "abaci"

Remove words with repeated letters

If I need five words covering 25 letters, then no word should have two of the same letter.

Once these are filteredd out 8322 words remain.

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' Determine if a word has any repeated letters
#' @param words character vector of words
#' @return logical vector
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
has_doubles <- function(words) {
  vapply(strsplit(words, ''), function(lets) {
    any(table(lets) > 1)
  }, logical(1))
}

words <- words[!has_doubles(words)]
length(words)

#> [1] 8322

Pre-calc strsplit

Pre-calculate a data structure with the letters in each word.

Letters are represented by their index within the alphabet

letters_in_word <- lapply(words, function(word) {
  utf8ToInt(word) - 96L
})

test_idx <- which(words == 'panic')
letters_in_word[[test_idx]]

#> [1] 16  1 14  9  3

letters[letters_in_word[[test_idx]]]

#> [1] "p" "a" "n" "i" "c"

Build letter membership indices

Build a data structure mapping each letter of the alphabet to which words contain that letter.

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Membership indices
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
words_with_letter <- list()
for (let in letters) {
  words_with_letter[[let]] <- grep(let, words)
}

For instance, can now lookup all the letters which contain ‘q’

words_with_letter[['q']]

#>  [1]  984 1268 1437 2196 2197 2198 2289 2418 2419 3099 4370 4870 5391 5723 5724
#> [16] 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739
#> [31] 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754
#> [46] 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769
#> [61] 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784
#> [76] 5785 5786 5787 6079 6798 6799 6800 6801 6802 6803 6804 6805 6806 7307 7359
#> [91] 7545 7896

words[words_with_letter[['q']]]

#>  [1] "burqa" "cinqs" "coqui" "equal" "equid" "equip" "faqir" "fiqhs" "fique"
#> [10] "guqin" "maqui" "niqab" "pique" "qadis" "qaids" "qapik" "qibla" "qophs"
#> [19] "qorma" "quack" "quads" "quags" "quail" "quair" "quais" "quake" "quaky"
#> [28] "quale" "qualm" "quant" "quare" "quark" "quart" "quash" "quasi" "quate"
#> [37] "quats" "quayd" "quays" "qubit" "quean" "quena" "quern" "query" "quest"
#> [46] "queyn" "queys" "quich" "quick" "quids" "quiet" "quilt" "quims" "quina"
#> [55] "quine" "quino" "quins" "quint" "quipo" "quips" "quire" "quirk" "quirt"
#> [64] "quist" "quite" "quits" "quoad" "quods" "quoif" "quoin" "quoit" "quonk"
#> [73] "quops" "quota" "quote" "quoth" "qursh" "quyte" "roque" "squab" "squad"
#> [82] "squat" "squaw" "squeg" "squib" "squid" "squit" "squiz" "toque" "tranq"
#> [91] "umiaq" "waqfs"

Pick a word and recursively build word sets

This function bails early if the set of ramining candidate words in any search drops to zero.

If a five-word sequence is found then print it.

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#' Recursively filter a set of words
#'
#' @param word_idx an index in [1, length(words)]
#' @param remaining_idx all other word indexes remaining at this stage
#' @param word_seq integer vector of the word indices up to this point
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
filter_words <- function(word_idx,
                         remaining_idx,
                         word_seq   = c()) {

  # In the remaining list of words, filter out any words that contain
  # any of the letters of the current word
  for (letter_idx in letters_in_word[[word_idx]]) {
    remaining_idx <- setdiff(remaining_idx, words_with_letter[[letter_idx]])
  }

  # Bail early if no words available after exclusion
  if (length(remaining_idx) == 0L) return()

  # Update the word sequence 
  word_seq <- c(word_seq, word_idx)

  if (length(word_seq) == 4) {
    # terminal condition
    for (j in remaining_idx) {
      cat(words[c(word_seq, j)], "\n")
    }
  } else {
    for (jj in seq_along(remaining_idx)) {
      filter_words(
        word_idx      = remaining_idx[[jj]],
        remaining_idx = remaining_idx[-seq(jj)],
        word_seq      = word_seq
      )
    }
  }

}

Example starting at `jumby`

Picking jumby as the first word and trying all word sets which start with this word takes about a minute on my old laptop.

However, different starting words will take different times because:

candidate words are only those which come alphabetically after the starting word, so words later in the dictionary will be faster because only words which are after this one are posibble candidate words in the search tree.
different starting words can more rapidly prune the search space, meaning that there are fewer possible candidate words in the search tree

i <- 3666  # 'jumby'
filter_words(
  word_idx = i,
  remaining_idx = seq.int(i+1L, length(words))
)

#> jumby pling treck vozhd waqfs

Search the entire space of five-word sets

I’ve run this sequentially in a for loop and it took about 8 hours to find the word sets.

for (i in seq_along(words)) {
  filter_words(
    word_idx = i,
    remaining_idx = seq.int(i+1L, length(words))
  )
}

bemix clunk grypt vozhd waqfs
bling jumpy treck vozhd waqfs
blunk cimex grypt vozhd waqfs
brick glent jumpy vozhd waqfs
brung cylix kempt vozhd waqfs
brung kempt vozhd waqfs xylic
chunk fjord gymps vibex waltz
clipt jumby kreng vozhd waqfs
fjord gucks nymph vibex waltz
glent jumby prick vozhd waqfs
jumby pling treck vozhd waqfs

Find sets of five words in the wordle dictionary that contain 25 unique letters