I need to read some foreign data formats into R data structures for manipulation. If the data was csv or yaml or json then I’d just use a package that someone else had written to read in the data. In my case, the data format doesn’t have its own package, so I need to write some parsing code from scratch.

To properly parse text into a structured format I’m going to need a lexer (or tokenizer) and a parser. The following succinct definitions are from StackOverflow:

  • A tokenizer breaks a stream of text into tokens
  • A lexer is basically a tokenizer, but it usually attaches extra context to the tokens – this token is a number, that token is a string literal, this other token is an equality operator.
  • A parser takes the stream of tokens from the lexer and turns it into a (hopefully) tidy data structure that can be manipulated programatically. For parsing computer programs, the final data structure is most likely an abstract syntax tree, but for parsing data, the output can be whatever data structure makes sense.

In this post I’ll show some example usage of a new package called minilexer which can be used to help write parsers for simple text formats.

In future posts I’ll show how to write parsers which actually do something interesting with this package.

Introducing the minilexer package

minilexer provides some tools for simple tokenising/lexing and parsing text files.

I will emphasise the mini in minilexer as this is not a rigorous or formally complete lexer, but it suits 90% of my needs for turning data text formats into tokens.

For complicated parsing (especially of computer programs) you’ll probably want to use the more formally correct lexing/parsing provided by the rly package or the dparser package.



Package Overview

Current the package provides one function, and one R6 class:

  • minilexer::lex(text, patterns) for splitting the text into tokens.
    • This function uses the user-defined regular expressions (patterns) to split text into a character vector of tokens.
    • The patterns argument is a named vector of character strings representing regular expressions for elements to match within the text.
  • minilexer::TokenStream is a class to handle manipulation/interrogation of the stream of tokens to make it easier to write parsers.

Example: Use lex() to split a sentence into tokens

sentence_patterns <- c(
  word        = "\\w+", 
  whitespace  = "\\s+",
  fullstop    = "\\.",
  comma       = "\\,"

sentence = "Hello there, Rstats."

lex(sentence, sentence_patterns)
##       word whitespace       word      comma whitespace       word 
##    "Hello"        " "    "there"        ","        " "   "Rstats" 
##   fullstop 
##        "."

Example: Use lex() to split some simplified R code into tokens

R_patterns <- c(
  number      = "-?\\d*\\.?\\d+",
  name        = "\\w+",
  equals      = "==",
  assign      = "<-|=",
  plus        = "\\+",
  lbracket    = "\\(",
  rbracket    = "\\)",
  newline     = "\n",
  whitespace  = "\\s+"

R_code <- "x <- 3 + 4.2 + rnorm(1)"

R_tokens <- lex(R_code, R_patterns)
##       name whitespace     assign whitespace     number whitespace 
##        "x"        " "       "<-"        " "        "3"        " " 
##       plus whitespace     number whitespace       plus whitespace 
##        "+"        " "      "4.2"        " "        "+"        " " 
##       name   lbracket     number   rbracket 
##    "rnorm"        "("        "1"        ")"

Example: Use TokenStream to interrogate/manipulate the tokens

The TokenStream class is a way of manipulating a stream of tokens to make it easier(*) to write parsers. It is a way of keeping track of which token we are currently looking at, and making assertions about the current token’s value and type.

In the following examples, I’ll be using the R_tokens I extracted above.

# create the stream to handle the tokens
stream <- minilexer::TokenStream$new(R_tokens)

# What position are we at?
## [1] 1
# Assert that the first token is a name and has the value 'x'

# Show what happens if the current token isn't what we expect
## Error: Expected ["+"] at position 1 but found [name]: "x"
# Try and consume this token and move onto the next one, but
# because the 'type' is incorrect, will result in failure
## Error: Expected ["number"] at position 1 but found [name]: "x"
# Unconditionally consume this token without regard to 
# its value or type. This returns the value at the 
# current position, and then increments the position
## [1] "x"
# Stream position should have moved on to the second value
## [1] 2
# Get the current value, but without advancing the position
## [1] " "
# consume it. i.e. return current value and increment position
## [1] " "
# Stream position should have moved on to the third value
## [1] 3
# Get the current value
## [1] "<-"
## [1] "assign"