Break a string into labelled tokens based upon a set of patterns

lex(text, regexes, verbose = FALSE, ...)

Arguments

text

a single character string

regexes

a named vector of regex strings. Each string represents a regex to match a token, and the name of the string is the label for the token. Each regex can contain an explicit captured group using the standard () brackets. If a regex doesn't not define a captured group then the entire regex will be captured. The regexes will be processed in order such that an early match takes precedence over any later match.

verbose

print more information about the matching process. default: FALSE

...

further arguments passed to stringi::stri_match_all(). e.g. multiline = TRUE

Value

a named character vector with the names representing the token type with the value being the element extracted by the corresponding regular expression.

Examples

lex("hello there 123.45", regexes=c(number=re$number, word="(\\w+)", whitespace="(\\s+)"))
#> word whitespace word whitespace number #> "hello" " " "there" " " "123.45"