Parsing Chess PGN files in RStats

The Queen’s Gambit - Post 3

Inspired by The Queen’s Gambit on Netflix, I’m doing a few posts on Chess in R.

This screenshot from the show explains everything:

Chess game format: pgn

The pgn file format is a human readable representation of a chess game.

In its most basic form, it consists of

  • a sequence of tags (i.e. comments) surrounded by []
  • a sequence of numbers and events representing the moves taken by the players i.e.
  • Comments can be interspersed between/within the moves and are surrounded by “{}”

An example pgn file is show below:

alekhine_pgn <- r'{[Event "Vilnius All-Russian Masters"]
[Site "Vilna (Vilnius) RUE"]
[Date "1912.08.23"]
[EventDate "1912.08.19"]
[Round "5"]
[Result "0-1"]
[White "Alexander Alekhine"]
[Black "Akiba Rubinstein"]
[ECO "C83"]
[WhiteElo "?"]
[BlackElo "?"]
[PlyCount "54"]

1. e4 {Notes by Dr. Savielly Tartakower.} 1... e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4
Nf6 5. O-O Nxe4 6. d4 b5 7. Bb3 d5 8. dxe5 Be6 9. c3 Be7 10. Nbd2 Nc5 11. Bc2
Bg4 12. h3 {The most reasonable course here is 12.Re1, guarding the e-pawn.}
12... Bh5 13. Qe1 $6 {Here again 13. Re1 ensured a very good game for White.}
13... Ne6 14. Nh2 $6 Bg6 $1 15. Bxg6 fxg6 {! Far seeing strategy! Black
recognizes that the f-file and not the e-file will be needed as a base for
action.} 16. Nb3 {Or 16.f4 d4!.} 16... g5 $1 17. Be3 O-O 18. Nf3 Qd7 19. Qd2
{White pays insufficient attention to the scope of his opponent's threats. A
better course is 19.Nfd4 (19...Nxe5 20.Bxg5) seeking to establish equality.}
19... Rxf3 $1 20. gxf3 Nxe5 21. Qe2 Rf8 22. Nd2 Ng6 23. Rfe1 Bd6 24. f4 Nexf4
25. Qf1 Nxh3+ 26. Kh1 g4 27. Qe2 Qf5 0-1}'

Use lex() to turn the text into tokens

  1. Start by defining the regular expression patterns for each element in the pgn file.
  2. Use minilexer::lex() to turn the pgn text into tokens
  3. Throw away whitespace, newlines and tags, since I’m not interested in them.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Use the mini-lexer to break text into labelled tokens
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# remotes::install_github('coolbutuseless/minilexer')
library(minilexer)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define all the patterns to match as regular expressions.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pgn_patterns <- c(
  tag           = '\\[.*?\\]',     # Capture tags as a unit
  comment       = "\\{.*?\\}",     # Capture comments as a unit
  resumption    = "\\d+\\.\\.\\.", # Resume moves after comment
  move_number   = "\\d+\\.",
  end_of_game   = '0-1|1-0|0-0|1/2-1/2',
  nag           = '\\$\\d+',      # Numeric annotation glyph
  move          = '[-+\\w\\./]+', # Anything else is a move
  newline       = '\n',
  whitespace    = '\\s+'
)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define some different sets of tokens
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
chaff     <- c('whitespace', 'newline', 'tag')
non_moves <- c('comment', 'resumption', 'nag', 'end_of_game')

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Parse a PGN file to tokens
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pgn_text <- alekhine_pgn
pgn_text <- gsub("\n", ' ', pgn_text)
tokens   <- minilexer::lex(pgn_text, pgn_patterns)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Tidy the tokens
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tags   <- tokens[names(tokens) == 'tag']
tokens <- tokens[!(names(tokens) %in% chaff)]

head(tokens, 20)
##                           move_number                                  move 
##                                  "1."                                  "e4" 
##                               comment                            resumption 
## "{Notes by Dr. Savielly Tartakower.}"                                "1..." 
##                                  move                           move_number 
##                                  "e5"                                  "2." 
##                                  move                                  move 
##                                 "Nf3"                                 "Nc6" 
##                           move_number                                  move 
##                                  "3."                                 "Bb5" 
##                                  move                           move_number 
##                                  "a6"                                  "4." 
##                                  move                                  move 
##                                 "Ba4"                                 "Nf6" 
##                           move_number                                  move 
##                                  "5."                                 "O-O" 
##                                  move                           move_number 
##                                "Nxe4"                                  "6." 
##                                  move                                  move 
##                                  "d4"                                  "b5"

Final Game Record (after some manual tidying)

Tag Value
Event Vilnius All-Russian Masters
Site Vilna (Vilnius) RUE
Date 1912.08.23
EventDate 1912.08.19
Round 5
Result 0-1
White Alexander Alekhine
Black Akiba Rubinstein
ECO C83
WhiteElo ?
BlackElo ?
PlyCount 54
N White Black Comment
1. e4 e5 Notes by Dr. Savielly Tartakower.
2. Nf3 Nc6
3. Bb5 a6
4. Ba4 Nf6
5. O-O Nxe4
6. d4 b5
7. Bb3 d5
8. dxe5 Be6
9. c3 Be7
10. Nbd2 Nc5
11. Bc2 Bg4
12. h3 Bh5 The most reasonable course here is 12.Re1, guarding the e-pawn.
13. Qe1 Ne6 Here again 13. Re1 ensured a very good game for White.
14. Nh2 Bg6
15. Bxg6 fxg6 ! Far seeing strategy! Black recognizes that the f-file and not the e-file will be needed as a base for action.
16. Nb3 g5 Or 16.f4 d4!.
17. Be3 O-O
18. Nf3 Qd7
19. Qd2 Rxf3 White pays insufficient attention to the scope of his opponent’s threats. A better course is 19.Nfd4 (19…Nxe5 20.Bxg5) seeking to establish equality.
20. gxf3 Nxe5
21. Qe2 Rf8
22. Nd2 Ng6
23. Rfe1 Bd6
24. f4 Nexf4
25. Qf1 Nxh3+
26. Kh1 g4
27. Qe2 Qf5