simpletrie

simpletrie is a simple trie implementation in plain R code.

A trie is a structure for representing certain data as a tree of nodes. This data representation allows for fast queries of certain types.

simpletrie can build a trie from 175k words in about 3s.

What’s in the box

trie_create() - create a trie, and optionally initialise with a dictionary of words
trie_insert() - insert words into a trie
trie_find_subwords() - find all sub-words in the trie using the characters in the given word

Installation

You can install from GitHub with:

# install.package('remotes')
remotes::install_github('coolbutuseless/simpletrie')

Example: Finding sub-words from a given word

My main use-case for simpletrie is for looking up scrabble words i.e. to find all possible sub-words of a given word with:

the characters in any order
being able to specify a wildcard to indicate that any letter can be used i.e. the . character

library(simpletrie)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read the ENABLE word list
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
words <- tolower(readLines("enable1.txt"))
length(words)
[1] 172821

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create trie strcture
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
system.time({
  trie <- trie_create(words)
})
   user  system elapsed 
  1.762   0.049   1.820 

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Find all words which can be made with 'hi.rstats'
# '.' is a wildcard, and can match any letter. (Like a blank tile in scrabble)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
system.time({
  subwords <- trie_find_subwords(trie, 'hi.rstats')
})
   user  system elapsed 
  0.109   0.003   0.113 

length(subwords)
[1] 2304
head(subwords, 100)
  [1] "hi"       "hire"     "hires"    "his"      "hist"     "hists"   
  [7] "hists"    "hiss"     "hisser"   "hissy"    "hisn"     "hiss"    
 [13] "hist"     "hists"    "hit"      "hits"     "hitter"   "hitters" 
 [19] "hits"     "hiatus"   "hic"      "hid"      "hie"      "hies"    
 [25] "hilt"     "hilts"    "hila"     "hilar"    "him"      "hin"     
 [31] "hins"     "hint"     "hints"    "hip"      "hips"     "his"     
 [37] "hiss"     "hist"     "hists"    "hit"      "hits"     "ha"      
 [43] "hair"     "hairs"    "hairs"    "hairy"    "haik"     "haiks"   
 [49] "hail"     "hails"    "hair"     "hairs"    "harsh"    "hart"    
 [55] "harts"    "harts"    "harass"   "hard"     "hards"    "hare"    
 [61] "hares"    "hark"     "harks"    "harl"     "harls"    "harm"    
 [67] "harms"    "harp"     "harpist"  "harpists" "harps"    "hart"    
 [73] "harts"    "has"      "hast"     "hastier"  "hastiest" "haste"   
 [79] "hastes"   "hasty"    "hash"     "hasp"     "hasps"    "hast"    
 [85] "hat"      "hats"     "hatter"   "hatters"  "hate"     "hater"   
 [91] "haters"   "hates"    "hath"     "hats"     "haar"     "haars"   
 [97] "habit"    "habits"   "had"      "hadst"

What’s a trie?

A trie is an ordered data structure represented as a tree of nodes.

The image below shows a very small trie which represents the words: ‘cat’, ‘cog’, ‘dog’ and ‘dote.’

This representation allows for very fast queries of certain types.

Technical bits

The trie is implemented here as nested environments. Environments are created with new.env(parent = emptyenv(), hash = FALSE).

Using the emptyenv() as parent saves some memory, as the parent environment is never needed in this process.

hash is set to FALSE as using a hash table expands the memory usage by quite a bit - even if size is set to limit the initial hash table size.

Using a hash table on the environment doesn’t appreciably change the speed of trie creation or preforming look-ups.

Environments are used instead of lists as I make use of the reference semantics of environment objects when building the trie.

Acknowledgements

R Core for developing and maintaining such a wonderful language.
CRAN maintainers, for patiently shepherding packages onto CRAN and maintaining the repository

Introducing 'simpletrie' - a simple trie written in pure R (a.k.a. a tool for cheating at scrabble)