Parsing with ANTLR 4 and Go
What is ANTLR?
ANTLR is used by a number of popular projects, e.g Hive and Pig use it to parse Hadoop queries, Oracle and NetBeans uses it for their IDEs, and Twitter even uses it to understand search queries. Support was recently added so that ANTLR 4 can be used to generate parsers in pure Go. This article will explain some of the benefits of ANTLR, and walk us through a simple example.
Why use it?
It is possible to hand write a parser, but this process can be complex, error prone, and hard to change. Instead there are many parser generators that take a grammar expressed in an domain- specific way, and generates code to parse that language. Popular parser generates include bison and yacc. In fact, there is a version of yacc, goyacc, which is written in Go and was part of the main go repo until it was moved to golang.org/x/tools last year.
So why use ANTLR over these?
It uses a simple EBNF syntax to define the grammar, instead of a bespoke configuration language.
ANTLR is an Adaptive LL(*) parser, ALL(*) for short, whereas most other parser generators (e.g Bison and Yacc) are LALR. The difference between LL(*) and LALR is out of scope for this article, but simply LALR works bottom-up, and LL(*) works top-down. This has a bearing on how the grammar is written, making some languages easier or harder to express.
The generated code for a LL(*) parser is more understandable than a LALR parser. This is because LALR parsers are commonly table driven, whereas LL(*) parsers encode the logic in its control flow, making it more comprehensible.
Finally ANTLR is agnostic to the target language. A single grammar can be used to generate parsers in Java, Go, C, etc. Unlike Bison/Yacc which typically embeds target language code into the grammar, making it harder to port.
Installing ANTLR v4
ANTLR is a Java 1.7 application, that generates the Go code needed to parse your language. During development Java is needed, but once the parser is built only Go and the ANTLR runtime library is required. The ANTLR site has documentation on how to install this on multiple platforms, but in brief, you can do the following:
antlr command is now available in your shell. If you prefer, the
.jar file can be placed into a
~/bin directory, and the alias can be
stored in your
Classic calculator example
Let’s start with the “hello world” for parsers, the calculator example.
We want to build a parser that handles simple mathematical expressions
1 + 2 * 3. The focus of this article is on how to use Go with
ANTLR, so the syntax of the ANTLR language won’t be explained in
detail, but the ANTLR site has compressive documentation.
As we go along, the source is available to all examples.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
// Calc.g4 grammar Calc; // Tokens MUL: '*'; DIV: '/'; ADD: '+'; SUB: '-'; NUMBER: [0-9]+; WHITESPACE: [ \r\n\t]+ -> skip; // Rules start : expression EOF; expression : expression op=('*'|'/') expression # MulDiv | expression op=('+'|'-') expression # AddSub | NUMBER # Number ;
The above is a simple grammar split into two sections, tokens, and rules. The tokens are terminal symbols in the grammar, that is, they are made up of nothing but literal characters. Whereas rules are non- terminal states made up of tokens and/or other rules.
By convention this grammar must be saved with a filename that matches
the name of the grammar, in this case “Calc.g4” . To process this file,
and generate the Go parser, we run the
antlr command like so:
This will generate a set of Go files in the “parser” package and
subdirectory. It is possible to place the generated code in a different
package by using the
-package <name> argument. This is useful if your
project has multiple parsers, or you just want a more descriptive
package name for the parser. The generated files will look like the
The generated files consist of three main components, the Lexer, Parser, and Listener.
The Lexer takes arbitrary input and returns a stream of tokens. For
input such as
1 + 2 * 3, the Lexer would return the following tokens:
NUMBER (1), ADD (+), NUMBER (2), MUL (*), NUMBER (3), EOF.
The Parser uses the Lexer’s output and applies the Grammar’s rules. Building higher level constructs, such as expressions that can be used to calculate the result.
The Listener then allows us to make use of the the parsed input. As mentioned earlier, yacc requires language specific code to be embedded with the grammar. However, ANTLR separates this concern, allowing the grammar to be agnostic to the target programming language. It does this through use of listeners, which effectively allows hooks to be placed before and after every rule is encountered in the parsed input.
Using the Lexer
Let’s move onto an example of using this generated code, starting with the Lexer.
To begin with, the generated parser is imported from the local
import "./parser". Next the Lexer is created with some
In this example the input is a simple string,
"1 + 2 * 3" but there
antlr.InputStreams, for example, the
type can read directly from a file. The
InputStream is then passed to
a newly created Lexer. Note the name of the Lexer is
matches the grammar’s name defined in the Calc.g4.
The lexer is then used to consume all the tokens from the input, printing them one by one. This wouldn’t normally be necessary but we do this for demonstrative purposes.
Each token has two main components, the TokenType, and the Text. The TokenType is a simple integer representing the type of token, while the Text is literally the text that made up this token. All the TokenTypes are defined at the end of calc_lexer.go, with their string names stored in the SymbolicNames slice:
You may also note, that the Whitespace token is not printed, even though
the input clearly had whitespace. This is because the grammar was
designed to skip (i.e. discard) the whitespace
WHITESPACE: [ \r\n\t]+
Using the Parser
The Lexer on its own is not very useful, so the example can be modified to also use the Parser and Listener:
This is very similar to before, but instead of manually iterating over
the tokens, the lexer is used to create a
which in turn is used to create a new
then “walked”, which is ANTLR’s event-driven API for receiving the
results of parsing the rules.
Walk function does not return anything. Some may
have expected a parsed form of the expression to be returned, such as
some kind of AST
(abstract syntax tree), but instead the Listener receives event as the
parsing occurs. This is similar in concept to
SAX style parsers
for XML. Event-based parsing can sometimes be harder to use, but it has
many advantages. For example, the parser can be very memory efficient as
previously parsed rules can be discarded once they are no longer needed.
The parser can also be aborted early if the programmer wishes to.
But so far, this example doesn’t do anything beyond ensuring the input
can be parsed without error. To add logic, we must extend the
calcListener type. The
calcListener has an embedded
BaseCalcListener, which is a helper type, that provides empty methods
for all those defined in in the
CalcListener interface. That interface
There is an Enter and Exit function for each rule found in the grammar. As the input is walked, the Parser calls the appropriate function on the listener, to indicate when the rule starts and finishes being evaluated.
Adding the logic
A simple calculator can be constructed from this event driven parser by using a stack of values. Every time a number is found, it is added to a stack. Everytime an expression (add/multiple/etc) is found, the last two numbers on the stack are popped, and the appropriate operation is carried out. The result is then placed back on the stack.
Take the expression
1 + 2 * 3, the result could be either
(1 + 2) *
3 = 9, or
1 + (2 * 3) = 7. Those that recall the order of
know that multiplication should always be carried out before addition,
thus the correct result is 7. However, without the parentheses there
could be some ambiguity on how this should be parsed. Luckily the
ambiguity is resolved by the grammar. The precedence of multiplication
over addition was subtly implied within Calc.g4, by placing the
expressed before the
The code for a listener that implements this stack of value implementation is relatively simple:
Finally this listener would be used like so:
Following the algorithm, the parsing of
1 + 2 * 3 would work like so.
- The numbers 2 and 3 would be visited first (and placed on the stack),
- Then the MulDiv expression would be visited, taking the values 2 and 3, multiplying them, and placing the result, 6, back on the stack.
- Then the number 1 would visited and pushed onto the stack.
- Finally AddSub would be visited, popping the 1 and the 6 from the stack, placing the result 7 back.
The order the rules are visited is completely driven by the Parser, and thus the grammar.
Learning how to write a grammar may be daunting, but there are many resources for help. The author of ANTLR, Terence Parr, has published a book, with some of the content freely available on antlr.org.
If you don’t want to write your own grammar, there are many pre-written grammars available. Including grammars for CSS, HTML, SQL, etc, as well many popular programming languages. To make it easier, I have generated parsers for all those available grammars, making them as easy to use just by importing.
A quick example of using one of the pre-generated grammars:
Hopefully this article has given you a taste of how to use ANTLR with Go. The examples for this article are found here, and the godoc for the ANTLR library is here which explains the various InputStream, Lexer, Parser, etc interfaces.