Simple Tools for Lexing/Parsing Text Data • flexo

flexo provides tools for simple tokenising/lexing/parsing of text files.

flexo aims to be useful in getting otherwise unsupported text data formats into R.

What’s in the box

lex(text, regexes) break a text string into tokens using the supplied regular expressions
TokenStream is an R6 class for manipulating a stream of tokens - a first step for parsing the data into a more useful format
create_stream() is a base R version of TokenStream which uses environments directly

Installation

You can install flexo from github coolbutuseless/flexo with

# install.packages('remotes')
remotes::install_github('coolbutuseless/flexo', ref='main')

Usage Overview

Define a set of regular expressions (regexes) that define the tokens in the data
Call lex() to use these regexes to split data into tokens i.e. lex(text, regexes)
lex() returns a named character vector of tokens. The names of the tokens correspond to the respective regex which captured it.
Optionally use the TokenStream R6 class to aid in the manipulation of the raw tokens into more structured data.
New in v0.2.5 use a base R environment-based token stream iniated with create_stream(tokens)

I often do not import the flexo package for projects, but instead copy the lex.R and stream.R files into the new package. This avoids having a non-CRAN dependency on an otherwise simple project.

Limitations

For complicated parsing (e.g. programming languages) you’ll want to use the more formally correct lexing/parsing provided by the rly package or the dparser package.

Vignettes

Vignettes are available to read online

Example: Using `lex()` to split sentence into tokens

sentence_regexes <- c(
  word        = "\\w+", 
  whitespace  = "\\s+",
  fullstop    = "\\.",
  comma       = ","
)

sentence = "Hello there, Rstats."

flexo::lex(sentence, sentence_regexes)

##       word whitespace       word      comma whitespace       word   fullstop 
##    "Hello"        " "    "there"        ","        " "   "Rstats"        "."

Example: Using `lex()` to split some simplified R code into tokens

R_regexes <- c(
  number      = "-?\\d*\\.?\\d+",
  name        = "\\w+",
  equals      = "==",
  assign      = "<-|=",
  plus        = "\\+",
  lbracket    = "\\(",
  rbracket    = "\\)",
  newline     = "\n",
  whitespace  = "\\s+"
)

R_code <- "x <- 3 + 4.2 + rnorm(1)"

R_tokens <- flexo::lex(R_code, R_regexes)
R_tokens

##       name whitespace     assign whitespace     number whitespace       plus 
##        "x"        " "       "<-"        " "        "3"        " "        "+" 
## whitespace     number whitespace       plus whitespace       name   lbracket 
##        " "      "4.2"        " "        "+"        " "    "rnorm"        "(" 
##     number   rbracket 
##        "1"        ")"

Example: Using `lex()` with `TokenStream`

Once lex() is used to create the separate tokens, the next step is to intepret the token sequence into something much more structured e.g. a data.frame or matrix.

The example below shows flexo being used to to parse a hypothetical tic-tac-toe game format into a matrix.

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# A tough game between myself and Kasparov (my pet rock)
# The comment line denotes who the game was between
# Each square is marked with an 'X' or 'O'
# After each X and O is a number indicating the order in which the mark
# appeared on the board.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
game <- "
# Kasparov(X) vs Coolbutuseless(O)
  X2 | O1 | O5 
  O3 | X4 | X6
  X7 | O8 | X9
"

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define all the regexes to split the game into tokens
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
game_regexes <- c(
  comment     = "(#.*?)\n", 
  whitespace  = "\\s+",
  sep         = "\\|",
  mark        = "X|O",
  order       = flexo::re$number
)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Use flexo::lex() to break game into tokens with these regexes
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokens <- flexo::lex(game, game_regexes)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Remove some tokens that don't contain actual information
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokens <- tokens[!names(tokens) %in% c('whitespace', 'comment', 'sep')]

tokens

##  mark order  mark order  mark order  mark order  mark order  mark order  mark 
##   "X"   "2"   "O"   "1"   "O"   "5"   "O"   "3"   "X"   "4"   "X"   "6"   "X" 
## order  mark order  mark order 
##   "7"   "O"   "8"   "X"   "9"

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create a TokenStream object for help in manipulating the tokens
# Obviously there are easier ways to do this on such a simple example, but 
# the below example is hopefully illustrative of the technique
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream <- TokenStream$new(tokens)

mark  <- c()
order <- c()

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Keep processing all the tokens until done.  
# The tokens should exist in pairs of 'mark' and 'order', so assert that pairing
# Consume and store the values from the stream into 'mark' and 'order' vectors
# Bind all the information into a matrix
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
while (!stream$end_of_stream()) {
  stream$assert_name_seq(c('mark', 'order'))
  mark  <- c(mark , stream$consume(1))
  order <- c(order, stream$consume(1))
}

cbind(mark, order)

##      mark order
## mark "X"  "2"  
## mark "O"  "1"  
## mark "O"  "5"  
## mark "O"  "3"  
## mark "X"  "4"  
## mark "X"  "6"  
## mark "X"  "7"  
## mark "O"  "8"  
## mark "X"  "9"

flexo: Simple Lex/Parse Tools in R

What’s in the box

Installation

Usage Overview

Limitations

Vignettes

Example: Using `lex()` to split sentence into tokens

Example: Using `lex()` to split some simplified R code into tokens

Example: Using `lex()` with `TokenStream`

Links

License

Developers

Dev status

flexo: Simple Lex/Parse Tools in R

What’s in the box

Installation

Usage Overview

Limitations

Vignettes

Example: Using lex() to split sentence into tokens

Example: Using lex() to split some simplified R code into tokens

Example: Using lex() with TokenStream

Links

License

Developers

Dev status

Example: Using `lex()` to split sentence into tokens

Example: Using `lex()` to split some simplified R code into tokens

Example: Using `lex()` with `TokenStream`