Handwritten Parsers & Lexers in Go
Handwritten Parsers & Lexers in Go
In these days of web apps and REST APIs it seems that writing parsers is a dying art. You may think parsers are a complex undertaking only reserved for programming language designers but I’d like to dispel this idea. Over the past few years I’ve written parsers for JSON, CSS3, and database query languages and the more that I write parsers the more that I love them.
The Basics
Let’s start off with the basics: what is a lexer and what is a parser? When we parse a language (or, technically, a “formal grammar”) we do it in two phases. First we break up series of characters into tokens. For a SQL-like language these tokens may be “whitespace”, “number”, “SELECT”, etc. This process is called lexing (or tokenizing or scanning).
Take this simple SQL SELECT statement as an example:
1
|
SELECT * FROM mytable |
When we tokenize this string we’d see it as:
1
|
`SELECT` • `WS` • `ASTERISK` • `WS` • `FROM` • `WS` • `STRING<"mytable">` |
This process, called lexical analysis, is similar to how we break up words in a sentence when we read. These tokens then get fed to a parser which performs semantic analysis.
The parser’s job is to make sense of these tokens and make sure they’re in the right order. This is similar to how we derive meaning from combining words in a sentence. Our parser will construct an abstract syntax tree (AST) from our series of tokens and the AST is what our application will use.
In our SQL SELECT example, our AST may look like:
|
|
Parser Generators
Many people use parser generators to automatically write a parser and lexer for
them. There are many tools made to do this: lex, yacc,
ragel. There’s even a Go implementation of yacc
built into the go
toolchain.
However, after using parser generators many times I’ve found them to be problematic. First, they involve learning a new language to declare your language format. Second, they’re difficult to debug. For example, try reading the Ruby language’s yacc file. Eek!
After watching a talk by Rob Pike on lexical scanning and reading the
implementation of the go
standard library package, I realized how
much easier and simpler it is to hand write your parser and lexer. Let’s walk
through the process with a simple example.
Writing a Lexer in Go
Defining our tokens
Let’s start by writing a simple parser and lexer for SQL SELECT statements. First, we need to define what tokens we’ll allow in our language. We’ll only allow a small subset of the SQL language:
|
|
We’ll use these tokens to represent series of characters. For example, WS
will
represent one or more whitespace characters and IDENT
will represent an
identifier such as a field name or a table name.
Defining character classes
It’s useful to define functions that will let us check the type of character. Here we’ll define two functions: one to check if a character is whitespace and one to check if the character is a letter.
|
|
It’s also useful to define an “EOF” rune so that we can treat EOF like any other character:
|
|
Scanning our input
Next we’ll want to define our Scanner
type. This type will wrap our input
reader with a bufio.Reader
so we can peek ahead at characters. We’ll also
add helper functions for reading and unreading characters from our underlying
reader.
|
|
The entry function into Scanner
will be the Scan()
method which return the
next token and the literal string that it represents:
|
|
This entry function starts by reading the first character. If the character is whitespace then it is consumed with all contiguous whitespace characters. If it’s a letter then it’s treated as the start of an identifier or keyword. Otherwise we’ll check to see if it’s one of our single character tokens.
Scanning contiguous characters
When we want to consume multiple characters in a row we can do this in a
simple loop. Here in scanWhitespace()
we’ll consume whitespace characters
until we hit a non-whitespace character:
|
|
The same logic can be applied to scanning our identifiers. Here in scanIdent()
we’ll read all letters and underscores until we hit a different character:
|
|
This function also checks at the end if the literal string is a reserved word. If so then a specialized token is returned.
Writing a Parser in Go
Setting up the parser
Once we have our lexer ready, parsing a SQL statement becomes easier. First
let’s define our Parser
:
|
|
Our parser simply wraps our scanner but also adds a buffer for the last read token. We’ll define helper functions for scanning and unscanning so we can use this buffer:
|
|
Our parser also doesn’t care about whitespace at this point so we’ll define a helper function to find the next non-whitespace token:
|
|
Parsing the input
Our parser’s entry function will be the Parse()
method. This function will
parse the next SELECT statement from the reader. If we had multiple statements
in our reader then we could call this function repeatedly.
|
|
Let’s break this function down into small parts. First we’ll define the AST structure we want to return from our function:
|
|
Then we’ll make sure there’s a SELECT
token. If we don’t see the token we
expect then we’ll return an error to report the string we found instead.
|
|
Next we want to parse a comma-delimited list of fields. In our parser we’re just considering identifiers and an asterisk as possible fields:
|
|
After our field list we want to see a FROM
keyword:
|
|
Then we want to see the name of the table we’re selecting from. This should be an identifier token:
|
|
If we’ve gotten this far then we’ve successfully parsed a simple SQL SELECT statement so we can return our AST structure:
1
|
return stmt, nil |
Congrats! You’ve just built a working parser!
Diving in deeper
You can find the full source of this example (with tests) at:
This parser example was heavily influenced by the InfluxQL parser. If you’re interested in diving deeper and understanding multiple statement parsing, expression parsing, or operator precedence then I encourage you to check out the repository:
If you have any questions or just love chatting about parsers, please find me on Twitter at @benbjohnson.