Parsing with ANTLR 4 and Go
What is ANTLR?
ANTLR (ANother Tool for Language Recognition), is an ALL(*) parser generator. In layman’s terms, Antlr, creates parsers in a number of languages (Go, Java, C, C#, Javascript), that can process text or binary input. The generated parser provides a callback interface to parse the input in an event-driven manner, which can be used as-is, or used to build parse trees (a data structure representing the input).
ANTLR is used by a number of popular projects, e.g Hive and Pig use it to parse Hadoop queries, Oracle and NetBeans uses it for their IDEs, and Twitter even uses it to understand search queries. Support was recently added so that ANTLR 4 can be used to generate parsers in pure Go. This article will explain some of the benefits of ANTLR, and walk us through a simple example.
Why use it?
It is possible to hand write a parser, but this process can be complex, error prone, and hard to change. Instead there are many parser generators that take a grammar expressed in an domain- specific way, and generates code to parse that language. Popular parser generates include bison and yacc. In fact, there is a version of yacc, goyacc, which is written in Go and was part of the main go repo until it was moved to golang.org/x/tools last year.
So why use ANTLR over these?
ANTLR has a suite of tools, and GUIs, that makes writing and debugging grammars easy.
It uses a simple EBNF syntax to define the grammar, instead of a bespoke configuration language.
ANTLR is an Adaptive LL(*) parser, ALL(*) for short, whereas most other parser generators (e.g Bison and Yacc) are LALR. The difference between LL(*) and LALR is out of scope for this article, but simply LALR works bottom-up, and LL(*) works top-down. This has a bearing on how the grammar is written, making some languages easier or harder to express.
The generated code for a LL(*) parser is more understandable than a LALR parser. This is because LALR parsers are commonly table driven, whereas LL(*) parsers encode the logic in its control flow, making it more comprehensible.
Finally ANTLR is agnostic to the target language. A single grammar can be used to generate parsers in Java, Go, C, etc. Unlike Bison/Yacc which typically embeds target language code into the grammar, making it harder to port.
Installing ANTLR v4
ANTLR is a Java 1.7 application, that generates the Go code needed to parse your language. During development Java is needed, but once the parser is built only Go and the ANTLR runtime library is required. The ANTLR site has documentation on how to install this on multiple platforms, but in brief, you can do the following:
|
|
The antlr
command is now available in your shell. If you prefer, the
.jar file can be placed into a ~/bin
directory, and the alias can be
stored in your ~/.bash_profile
.
Classic calculator example
Let’s start with the “hello world” for parsers, the calculator example.
We want to build a parser that handles simple mathematical expressions
such as 1 + 2 * 3
. The focus of this article is on how to use Go with
ANTLR, so the syntax of the ANTLR language won’t be explained in
detail, but the ANTLR site has compressive documentation.
As we go along, the source is available to all examples.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
// Calc.g4 grammar Calc; // Tokens MUL: '*'; DIV: '/'; ADD: '+'; SUB: '-'; NUMBER: [0-9]+; WHITESPACE: [ \r\n\t]+ -> skip; // Rules start : expression EOF; expression : expression op=('*'|'/') expression # MulDiv | expression op=('+'|'-') expression # AddSub | NUMBER # Number ; |
The above is a simple grammar split into two sections, tokens, and rules. The tokens are terminal symbols in the grammar, that is, they are made up of nothing but literal characters. Whereas rules are non- terminal states made up of tokens and/or other rules.
By convention this grammar must be saved with a filename that matches
the name of the grammar, in this case “Calc.g4” . To process this file,
and generate the Go parser, we run the antlr
command like so:
|
|
This will generate a set of Go files in the “parser” package and
subdirectory. It is possible to place the generated code in a different
package by using the -package <name>
argument. This is useful if your
project has multiple parsers, or you just want a more descriptive
package name for the parser. The generated files will look like the
following:
|
|
The generated files consist of three main components, the Lexer, Parser, and Listener.
The Lexer takes arbitrary input and returns a stream of tokens. For
input such as 1 + 2 * 3
, the Lexer would return the following tokens:
NUMBER (1), ADD (+), NUMBER (2), MUL (*), NUMBER (3), EOF
.
The Parser uses the Lexer’s output and applies the Grammar’s rules. Building higher level constructs, such as expressions that can be used to calculate the result.
The Listener then allows us to make use of the the parsed input. As mentioned earlier, yacc requires language specific code to be embedded with the grammar. However, ANTLR separates this concern, allowing the grammar to be agnostic to the target programming language. It does this through use of listeners, which effectively allows hooks to be placed before and after every rule is encountered in the parsed input.
Using the Lexer
Let’s move onto an example of using this generated code, starting with the Lexer.
|
|
To begin with, the generated parser is imported from the local
subdirectory import "./parser"
. Next the Lexer is created with some
input:
|
|
In this example the input is a simple string, "1 + 2 * 3"
but there
are other antlr.InputStream
s, for example, the antlr.FileStream
type can read directly from a file. The InputStream
is then passed to
a newly created Lexer. Note the name of the Lexer is CalcLexer
which
matches the grammar’s name defined in the Calc.g4.
The lexer is then used to consume all the tokens from the input, printing them one by one. This wouldn’t normally be necessary but we do this for demonstrative purposes.
|
|
Each token has two main components, the TokenType, and the Text. The TokenType is a simple integer representing the type of token, while the Text is literally the text that made up this token. All the TokenTypes are defined at the end of calc_lexer.go, with their string names stored in the SymbolicNames slice:
|
|
You may also note, that the Whitespace token is not printed, even though
the input clearly had whitespace. This is because the grammar was
designed to skip (i.e. discard) the whitespace WHITESPACE: [ \r\n\t]+
-> skip;
.
Using the Parser
The Lexer on its own is not very useful, so the example can be modified to also use the Parser and Listener:
|
|
This is very similar to before, but instead of manually iterating over
the tokens, the lexer is used to create a CommonTokenStream
,
which in turn is used to create a new CalcParser
. This CalcParser
is
then “walked”, which is ANTLR’s event-driven API for receiving the
results of parsing the rules.
Note, the Walk
function does not return anything. Some may
have expected a parsed form of the expression to be returned, such as
some kind of AST
(abstract syntax tree), but instead the Listener receives event as the
parsing occurs. This is similar in concept to
SAX style parsers
for XML. Event-based parsing can sometimes be harder to use, but it has
many advantages. For example, the parser can be very memory efficient as
previously parsed rules can be discarded once they are no longer needed.
The parser can also be aborted early if the programmer wishes to.
But so far, this example doesn’t do anything beyond ensuring the input
can be parsed without error. To add logic, we must extend the
calcListener
type. The calcListener
has an embedded
BaseCalcListener
, which is a helper type, that provides empty methods
for all those defined in in the CalcListener
interface. That interface
looks like:
|
|
There is an Enter and Exit function for each rule found in the grammar. As the input is walked, the Parser calls the appropriate function on the listener, to indicate when the rule starts and finishes being evaluated.
Adding the logic
A simple calculator can be constructed from this event driven parser by using a stack of values. Every time a number is found, it is added to a stack. Everytime an expression (add/multiple/etc) is found, the last two numbers on the stack are popped, and the appropriate operation is carried out. The result is then placed back on the stack.
Take the expression 1 + 2 * 3
, the result could be either (1 + 2) *
3 = 9
, or 1 + (2 * 3) = 7
. Those that recall the order of
operations, will
know that multiplication should always be carried out before addition,
thus the correct result is 7. However, without the parentheses there
could be some ambiguity on how this should be parsed. Luckily the
ambiguity is resolved by the grammar. The precedence of multiplication
over addition was subtly implied within Calc.g4, by placing the MulDiv
expressed before the AddSub
expression.
The code for a listener that implements this stack of value implementation is relatively simple:
|
|
Finally this listener would be used like so:
|
|
Following the algorithm, the parsing of 1 + 2 * 3
would work like so.
- The numbers 2 and 3 would be visited first (and placed on the stack),
- Then the MulDiv expression would be visited, taking the values 2 and 3, multiplying them, and placing the result, 6, back on the stack.
- Then the number 1 would visited and pushed onto the stack.
- Finally AddSub would be visited, popping the 1 and the 6 from the stack, placing the result 7 back.
The order the rules are visited is completely driven by the Parser, and thus the grammar.
More grammars
Learning how to write a grammar may be daunting, but there are many resources for help. The author of ANTLR, Terence Parr, has published a book, with some of the content freely available on antlr.org.
If you don’t want to write your own grammar, there are many pre-written grammars available. Including grammars for CSS, HTML, SQL, etc, as well many popular programming languages. To make it easier, I have generated parsers for all those available grammars, making them as easy to use just by importing.
A quick example of using one of the pre-generated grammars:
|
|
Conclusion
Hopefully this article has given you a taste of how to use ANTLR with Go. The examples for this article are found here, and the godoc for the ANTLR library is here which explains the various InputStream, Lexer, Parser, etc interfaces.
If you have any questions or comments, please reach out to me at @TheBramp or visit my website and blog, bramp.net for more articles.