flexo
provides tools for simple tokenising/lexing/parsing of text files.
flexo
aims to be useful in getting otherwise unsupported text data formats into R.
lex(text, regexes)
break a text string into tokens using the supplied regular expressionsTokenStream
is an R6 class for manipulating a stream of tokens - a first step for parsing the data into a more useful formatcreate_stream()
is a base R version of TokenStream
which uses environments directlyYou can install flexo
from github coolbutuseless/flexo with
# install.packages('remotes')
remotes::install_github('coolbutuseless/flexo', ref='main')
regexes
) that define the tokens in the datalex()
to use these regexes
to split data into tokens i.e. lex(text, regexes)
lex()
returns a named character vector of tokens. The names of the tokens correspond to the respective regex which captured it.TokenStream
R6 class to aid in the manipulation of the raw tokens into more structured data.create_stream(tokens)
I often do not import the flexo
package for projects, but instead copy the lex.R
and stream.R
files into the new package. This avoids having a non-CRAN dependency on an otherwise simple project.
For complicated parsing (e.g. programming languages) you’ll want to use the more formally correct lexing/parsing provided by the rly
package or the dparser
package.
Vignettes are available to read online
lex()
to split sentence into tokens
sentence_regexes <- c(
word = "\\w+",
whitespace = "\\s+",
fullstop = "\\.",
comma = ","
)
sentence = "Hello there, Rstats."
flexo::lex(sentence, sentence_regexes)
## word whitespace word comma whitespace word fullstop
## "Hello" " " "there" "," " " "Rstats" "."
lex()
to split some simplified R code into tokens
R_regexes <- c(
number = "-?\\d*\\.?\\d+",
name = "\\w+",
equals = "==",
assign = "<-|=",
plus = "\\+",
lbracket = "\\(",
rbracket = "\\)",
newline = "\n",
whitespace = "\\s+"
)
R_code <- "x <- 3 + 4.2 + rnorm(1)"
R_tokens <- flexo::lex(R_code, R_regexes)
R_tokens
## name whitespace assign whitespace number whitespace plus
## "x" " " "<-" " " "3" " " "+"
## whitespace number whitespace plus whitespace name lbracket
## " " "4.2" " " "+" " " "rnorm" "("
## number rbracket
## "1" ")"
lex()
with TokenStream
Once lex()
is used to create the separate tokens, the next step is to intepret the token sequence into something much more structured e.g. a data.frame or matrix.
The example below shows flexo
being used to to parse a hypothetical tic-tac-toe game format into a matrix.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# A tough game between myself and Kasparov (my pet rock)
# The comment line denotes who the game was between
# Each square is marked with an 'X' or 'O'
# After each X and O is a number indicating the order in which the mark
# appeared on the board.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
game <- "
# Kasparov(X) vs Coolbutuseless(O)
X2 | O1 | O5
O3 | X4 | X6
X7 | O8 | X9
"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define all the regexes to split the game into tokens
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
game_regexes <- c(
comment = "(#.*?)\n",
whitespace = "\\s+",
sep = "\\|",
mark = "X|O",
order = flexo::re$number
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Use flexo::lex() to break game into tokens with these regexes
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokens <- flexo::lex(game, game_regexes)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Remove some tokens that don't contain actual information
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokens <- tokens[!names(tokens) %in% c('whitespace', 'comment', 'sep')]
tokens
## mark order mark order mark order mark order mark order mark order mark
## "X" "2" "O" "1" "O" "5" "O" "3" "X" "4" "X" "6" "X"
## order mark order mark order
## "7" "O" "8" "X" "9"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create a TokenStream object for help in manipulating the tokens
# Obviously there are easier ways to do this on such a simple example, but
# the below example is hopefully illustrative of the technique
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream <- TokenStream$new(tokens)
mark <- c()
order <- c()
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Keep processing all the tokens until done.
# The tokens should exist in pairs of 'mark' and 'order', so assert that pairing
# Consume and store the values from the stream into 'mark' and 'order' vectors
# Bind all the information into a matrix
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
while (!stream$end_of_stream()) {
stream$assert_name_seq(c('mark', 'order'))
mark <- c(mark , stream$consume(1))
order <- c(order, stream$consume(1))
}
cbind(mark, order)
## mark order
## mark "X" "2"
## mark "O" "1"
## mark "O" "5"
## mark "O" "3"
## mark "X" "4"
## mark "X" "6"
## mark "X" "7"
## mark "O" "8"
## mark "X" "9"