flexo: Simple Lex/Parse Tools in R
flexo
provides tools for simple tokenising/lexing/parsing of text files.
flexo
aims to be useful in getting otherwise unsupported text data formats into R.
For complicated parsing (e.g. programming languages) you’ll want to use the more
formally correct lexing/parsing provided by the rly
package
or the dparser
package.
flexo
is a replacement for minilexer
(which should now be considered abandoned).
What’s in the box
lex(text, regexes)
break a text string into tokens using the supplied regular expressionsTokenStream
is an R6 class for manipulating a stream of tokens - a first step for parsing the data into a more useful format
Installation
You can install flexo
from github coolbutuseless/flexo with
# install.packages('remotes')
remotes::install_github('coolbutuseless/flexo', ref='main')
Usage Overview
- Define a set of regular expressions (
regexes
) that define the tokens in the data - Call
lex()
to use theseregexes
to split data into tokens i.e.lex(text, regexes)
lex()
returns a named character vector of tokens. The names of the tokens correspond to the respective regex which captured it.- Optionally use the
TokenStream
R6 class to aid in the manipulation of the raw tokens into more structured data.
Vignettes
Example: Using lex()
to split sentence into tokens
sentence_regexes <- c(
word = "\\w+",
whitespace = "\\s+",
fullstop = "\\.",
comma = ","
)
sentence = "Hello there, Rstats."
flexo::lex(sentence, sentence_regexes)
word whitespace word comma whitespace word fullstop
"Hello" " " "there" "," " " "Rstats" "."
Example: Using lex()
to split some simplified R code into tokens
R_regexes <- c(
number = "-?\\d*\\.?\\d+",
name = "\\w+",
equals = "==",
assign = "<-|=",
plus = "\\+",
lbracket = "\\(",
rbracket = "\\)",
newline = "\n",
whitespace = "\\s+"
)
R_code <- "x <- 3 + 4.2 + rnorm(1)"
R_tokens <- flexo::lex(R_code, R_regexes)
R_tokens
name whitespace assign whitespace number whitespace plus
"x" " " "<-" " " "3" " " "+"
whitespace number whitespace plus whitespace name lbracket
" " "4.2" " " "+" " " "rnorm" "("
number rbracket
"1" ")"
Example: Using lex()
with TokenStream
Once lex()
is used to create the separate tokens, the next step is to intepret
the token sequence into something much more structured e.g. a data.frame or
matrix.
The example below shows flexo
being used to to parse a hypothetical tic-tac-toe
game format into a matrix.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# A tough game between myself and Kasparov (my pet rock)
# The comment line denotes who the game was between
# Each square is marked with an 'X' or 'O'
# After each X and O is a number indicating the order in which the mark
# appeared on the board.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
game <- "
# Kasparov(X) vs Coolbutuseless(O)
X2 | O1 | O5
O3 | X4 | X6
X7 | O8 | X9
"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define all the regexes to split the game into tokens
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
game_regexes <- c(
comment = "(#.*?)\n",
whitespace = "\\s+",
sep = "\\|",
mark = "X|O",
order = flexo::regex$number
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Use flexo::lex() to break game into tokens with these regexes
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokens <- flexo::lex(game, game_regexes)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Remove some tokens that don't contain actual information
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokens <- tokens[!names(tokens) %in% c('whitespace', 'comment', 'sep')]
tokens
mark order mark order mark order mark order mark order mark order mark
"X" "2" "O" "1" "O" "5" "O" "3" "X" "4" "X" "6" "X"
order mark order mark order
"7" "O" "8" "X" "9"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create a TokenStream object for help in manipulating the tokens
# Obviously there are easier ways to do this on such a simple example, but
# the below example is hopefully illustrative of the technique
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream <- TokenStream$new(tokens)
mark <- c()
order <- c()
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Keep processing all the tokens until done.
# The tokens should exist in pairs of 'mark' and 'order', so assert that pairing
# Consume and store the values from the stream into 'mark' and 'order' vectors
# Bind all the information into a matrix
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
while (!stream$end_of_stream()) {
stream$assert_name_seq(c('mark', 'order'))
mark <- c(mark , stream$consume(1))
order <- c(order, stream$consume(1))
}
cbind(mark, order)
mark order
mark "X" "2"
mark "O" "1"
mark "O" "5"
mark "O" "3"
mark "X" "4"
mark "X" "6"
mark "X" "7"
mark "O" "8"
mark "X" "9"