How to Tokenize Complex Strings with Lexmachine
This article is about using lexmachine
to tokenize strings (split up into component parts) in the Go (golang)
programming language. If you find yourself processing a complex file format or
network protocol this article will walk you through how to use lexmachine
to
process both accurately and quickly. If you need more help after reading this
article take a look at the
documentation, a
tutorial,
or an explainer
article
on how it all works.
What is string tokenization?
String tokenization takes a piece of text and splits it into categorized
substrings. When conducted as part of parsing or compiling a programming
language it may also be called lexical analysis. Depending on the kind of text
being processed splitting it up might be very simple or extremely complicated.
For instance, some simple text formats (such as basic CSV or TSV files) separate
each record by newline characters (\n
or ASCII 0x0A
). At the other end of
the spectrum, natural language (such as English text) may be very difficult to
correctly tokenize.
Let’s look at a quick example of tokenizing an English sentence.
Mary had a little lamb and its fleece was white as snow.
Each word will be a separate token. A token is a pair made up of the lexeme (or substring) and the token type (or category). When tokenizing English, often the token type will be the part of speech.
1 2 |
Mary had a little lamb and ... <Proper Noun> <Verb> <Article> <Adjetive> <Noun> <Conjunction> ... |
Tokenizing English is very difficult in general because determining the part of speech of a particular word is not always obvious. The same word can play multiple semantic roles in a sentence. Humans consider the sentence in its entirety when determining the parts of speech for each word. This allows us to “noun verbs” and “verb nouns.”
String tokenization as part of parsing
There are many common applications of tokenization which are not as difficult as assigning parts of speech to words in English. For instance, computer languages, configuration files, log files, network protocols, data interchange formats, etc… are all easy to tokenize. In these applications (as with natural language), tokenization is the first step in parsing or compilation and is called lexical analysis.
Modern compilers are designed as a “pipeline” or series of steps (called passes) which operate over a program. They take the source code and transform it step by step into the desired output (machine code, assembly, another programming language). Breaking compilation into steps simplifies each stage and makes the individual passes reusable. The start of the pipeline is shown in Figure 1.
A compiler starts with source code and then preforms lexical analysis to split the code up into tokens. Then, the parser transforms that stream or list of tokens into a structured format (often an Abstract Syntax Tree).
But why tokenize before parsing? Surely, the parser could operate on the bytes of the source code rather than on a list of tokens. Some parsers do operate directly on the bytes. However, the parser can be simplified and made shorter and often more robust by defining it over tokens rather than bytes.
Consider the following, programming languages often have keywords: func
,
return
, if
. They also have “identifiers” or names. The keywords are set
aside as special names and are treated by a parser the same way other names are
treated as they often denote the start of special syntactic regions. In this
way, they really operate in a similar manner to punctuation in written English.
However, since keywords would be valid names if they were not reserved special care must be taken in a parser which operates on bytes directly not to accidentally classify a keyword as a name. Parsers which work on tokens rather than bytes can ignore this problem as it is handled at the lexical analysis stage.
Processing a custom configuration file format
As a motivating example for why you might need to perform a tokenization
operation in your daily life as a programmer let’s take a look at processing a
customer configuration file format. Linux has a library libsensor
which allows
it to read and understand the output of hardware sensors (such as CPU
temperature probes). It has a custom configuration file format,
sensors.conf, which describes how
Linux should translate the raw readings hardware monitoring chips to real-world
values – such as voltage and temperature. The first part of my laptop’s
sensors.conf
file begins with the following:
|
|
Let’s pretend we want to extract some information from sensors.conf
files. For
instance, maybe we want to know for each chip what labels have been defined.
Straightforward enough and definitely doable “by hand” if necessary but much
easier if the file is preprocessed to pull out the tokens:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Type | Lexeme ------------------- COMMENT | # It is recommended not to modify this file, but to drop your local COMMENT | # changes in /etc/sensors.d/. File with names that start with a dot COMMENT | # are ignored. CHIP | chip NAME | "lm78-*" NAME | "lm78-*" NAME | "lm79-*" NAME | "lm80-*" NAME | "lm96080-*" LABEL | label NAME | temp1 NAME | "M/B Temp" CHIP | chip NAME | "w83792d-*" ... SET | set NAME | in8_max FLOAT | 3.0 STAR | * FLOAT | 1.10 |
According to the man page the chip and label statements have the following structure (other bits left out):
|
|
So to find all of the labeled “features” of a chip we need to do two things.
First, find a chip
statement and collect all the names associated with it.
These names are actually patterns which match multiple chips but we will ignore
that for this example. Second, collect all the label
statements (and the two
associated names). The labels go with the chips which immediately proceed them.
To implement the above idea we are going to use lexmachine to construct a lexer (or tokenizer) which split up the file into the appropriate tokens.
Defining a lexer with lexmachine
A lexer (or tokenizer) is defined by a set of categories (token types) which
are defined by patterns. The patterns (in a traditional lexer) are expressed
as regular
expressions. As a
quick and incomplete review, regular expressions (regex) are a “pattern” which
describe a set of strings. A regex is made up of characters (a
, b
, 0
, @
,
…) which combine with the operators: concatenation abc
, alternation a|b
,
grouping a(b|c)d
, and repetition a*
. Some examples:
abc
matches {"abc"
}a|b
matches {"a"
,"b"
}a(b|c)d
matches {"abd"
,"acd"
}a*
matches {""
,"a"
,"aa"
, …}a(b(c|d))*
matches {"a"
,"abc"
,"abd"
,"abcbc"
,"abcbd"
, …}
A fairly close reading of the man
page for sensors.conf
gives the
following regular expressions to tokenize the file. (Note: []
define
character classes
which express a number of alternative characters to match)
AT
:@
PLUS
:\+
STAR
:\*
DASH
:-
SLASH
:/
BACKSLASH
:\\
CARROT
:\^
BACKTICK
:`
COMMA
:,
LPAREN
:\(
RPAREN
:\)
BUS
:bus
CHIP
:chip
LABEL
:label
COMPUTE
:compute
IGNORE
:ignore
SET
:set
NUMBER
:[0-9]*\.?[0-9]+
COMMENT
:#[^\n]*
SPACE
:\s+
NAME
:[a-zA-Z_][a-zA-Z0-9_]*
NAME
:"((\\\\)|(\\.)|([^"\\]))*"
With these regular expressions in hand let’s define a lexer using
lexmachine and use it to tokenize an
example sensors.conf
.
Aside, can’t we just use regexp?
A lexical analysis engine is very similar to a standard regular expression engine. However, the problem which regular expression engines solve (matching whole strings or finding patterns inside of strings) is slightly different than the lexial analysis problem. You could implement a lexical analyzer with the
regexp
package but it would be much slower than lexmachine. Lexical analysis engines are fast because they use the same theoretical framework as is used to implement regular expression engines with the following adjustments:
- Prefixes of a string are matched
- All patterns are matched at the same time
- The pattern which matches the longest prefix wins
- In case of ties, the pattern with the highest precedence wins
Constructing a *lexmachine.Lexer
Importing the packages
|
|
Defining the tokens types
There are many ways to represent the token types (categories). In this example,
the names of the types are defined as a list of strings. The types themselves
(id
s) will be the index of the names. An init
function is used to construct
a reverse mapping from name to id
.
|
|
Constructing a new lexer object
|
|
Adding a single pattern
Patterns are added using the
lexer.Add method.
Let’s add the pattern for the chip
keyword:
|
|
The Add
method takes two arguments. The first is the regular expression and
the second is the lexing action function. The action is called on pattern
discovery. This action typically takes a low-level
*machines.Match
object and turns it into a token representation of your (the user’s) choice.
For users who do not have strict requirements for how tokens are represented
the *lexmachine.Token
object provides a useful implementation. *lexmachine.Scanner has a utility
method for constructing the tokens which helps us write a simple getToken
action:
|
|
Adding all the patterns
|
|
Skipping a pattern
Sometimes it is advantageous to not emit tokens for certain patterns and to
instead skip them. Commonly this occurs for whitespace and comments. To skip a
pattern simply have the action return nil, nil
:
|
|
|
|
In our example for sensors.conf
we only care about the keywords and the NAME
tokens. Let’s skip the rest:
|
|
Constructing the lexer exactly once
The lexer should be constructed and compiled exactly once: on program startup (unless the regular expressions are defined dynamically). This ensures that you only pay the compilation costs at program startup and get the maximum benefit from the efficient DFA representation produced by compilation.
|
|
Example tokenization
|
|
Output
1 2 3 4 5 6 |
CHIP | "chip" | 1:1-1:4 NAME | "chip1" | 1:6-1:10 NAME | "chip2" | 1:12-1:16 LABEL | "label" | 1:18-1:22 NAME | "name" | 1:24-1:27 NAME | "value" | 1:29-1:33 |
For more details on the scanner interface consult the narrative documentation.
Associating label
s with chip
s
There are several ways we could approach taking the output of the scanner
above and using it to find what labels go with what chip names. In this example,
a very simple, short, but ugly method is going to be demonstrated. Inside of the
scanning for-loop a simple state machine will track what the current statement a
particular NAME
token belongs to. If it belongs to a chip
statement or a
label
statement it is appropriately associated. Finally, whenever a new CHIP
token is encountered the current buffer of labels are associated with the
previous chip statement.
|
|
Output for the sensors.conf
file on my machine:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
chip lm78-*: temp1:M/B Temp chip lm79-*: temp1:M/B Temp chip lm80-*: temp1:M/B Temp chip lm96080-*: temp1:M/B Temp chip w83792d-*: in0:VcoreA, in1:VcoreB, in6:+5V, in7:5VSB, in8:Vbat chip w83793-*: in0:VcoreA, in1:VcoreB, in7:+5V, in8:5VSB, in9:Vbat chip w83795g-*: in12:+3.3V, in13:3VSB, in14:Vbat chip w83795adg-*: in12:+3.3V, in13:3VSB, in14:Vbat chip via686a-*: in0:Vcore, in2:+3.3V, in3:+5V, in4:+12V chip adm1025-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, in5:VCC, temp1:CPU Temp, temp2:M/B Temp chip ne1619-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, in5:VCC, temp1:CPU Temp, temp2:M/B Temp chip lm87-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp1:M/B Temp, temp2:CPU Temp chip adm1024-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp1:M/B Temp, temp2:CPU Temp chip it87-*: in8:Vbat chip it8712-*: in8:Vbat chip it8716-*: in8:Vbat chip it8718-*: in8:Vbat chip it8720-*: in8:Vbat chip fscpos-*: in0:+12V, in1:+5V, in2:Vbat, temp1:CPU Temp, temp2:M/B Temp, temp3:Aux Temp chip fscher-*: in0:+12V, in1:+5V, in2:Vbat, temp1:CPU Temp, temp2:M/B Temp, temp3:Aux Temp chip fscscy-*: in0:+12V, in1:+5V, in2:+3.3V, temp1:CPU0 Temp, temp2:CPU1 Temp, temp3:M/B Temp, temp4:Aux Temp chip fschds-*: temp1:CPU Temp, temp2:Super I/O Temp, temp3:System Temp, an1:PSU Fan, an2:CPU Fan, an3:System FAN2, an4:System FAN3, an5:System FAN4, in0:+12V, in1:+5V, in2:Vbat chip fscsyl-*: temp1:CPU Temp, temp4:Super I/O Temp, temp5:Northbridge Temp, an1:CPU Fan, an2:System FAN2, an3:System FAN3, an4:System FAN4, an7:PSU Fan, in0:+12V, in1:+5V, in2:Vbat, in3:+3.3V, in5:+3.3V-Aux chip vt1211-*: in5:+3.3V, temp2:SIO Temp chip vt8231-*: in5:+3.3V chip smsc47m192-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, in5:VCC, temp1:SIO Temp chip lm85-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip lm85b-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip lm85c-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip adm1027-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip adt7463-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip adt7468-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip emc6d100-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip emc6d102-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip emc6d103-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip emc6d103s-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip emc6w201-*: in2:+3.3V, in3:+5V, temp6:M/B Temp chip pc87365-*: in7:3VSB, in8:VDD, in9:Vbat, in10:AVDD, temp3:SIO Temp chip pc87366-*: in7:3VSB, in8:VDD, in9:Vbat, in10:AVDD, temp3:SIO Temp chip adm1030-*: temp1:M/B Temp chip adm1031-*: temp1:M/B Temp chip w83627thf-*: in3:+5V, in7:5VSB, in8:Vbat chip w83627ehf-*: in0:Vcore, in2:AVCC, in3:+3.3V, in7:3VSB, in8:Vbat chip w83627dhg-*: in0:Vcore, in2:AVCC, in3:+3.3V, in7:3VSB, in8:Vbat chip w83667hg-*: in0:Vcore, in2:AVCC, in3:+3.3V, in7:3VSB, in8:Vbat chip nct6775-*: in0:Vcore, in2:AVCC, in3:+3.3V, in7:3VSB, in8:Vbat chip nct6776-*: in0:Vcore, in2:AVCC, in3:+3.3V, in7:3VSB, in8:Vbat chip w83627uhg-*: in2:AVCC, in3:+5V, in7:5VSB, in8:Vbat chip f71805f-*: in0:+3.3V chip f71872f-*: in0:+3.3V, in9:Vbat, in10:3VSB chip k8temp-*: temp1:Core0 Temp, temp2:Core0 Temp, temp3:Core1 Temp, temp4:Core1 Temp chip dme1737-*: in0:5VSB, in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, in5:3VSB, in6:Vbat, temp2:SIO Temp chip sch311x-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, in5:3VSB, in6:Vbat, temp2:SIO Temp chip sch5027-*: in0:5VSB, in1:Vcore, in2:+3.3V, in5:3VSB, in6:Vbat, temp2:SIO Temp chip sch5127-*: in2:+3.3V, in5:3VSB, in6:Vbat chip f71808e-*: in0:+3.3V, in7:3VSB, in8:Vbat chip f71808a-*: in0:+3.3V, in7:3VSB, in8:Vbat chip f71862fg-*: in0:+3.3V, in7:3VSB, in8:Vbat chip f71869-*: in0:+3.3V, in7:3VSB, in8:Vbat chip f71869a-*: in0:+3.3V, in7:3VSB, in8:Vbat chip f71882fg-*: in0:+3.3V, in7:3VSB, in8:Vbat chip f71889fg-*: in0:+3.3V, in7:3VSB, in8:Vbat chip f71889ed-*: in0:+3.3V, in7:3VSB, in8:Vbat chip f71889a-*: in0:+3.3V, in7:3VSB, in8:Vbat chip f71858fg-*: in0:+3.3V, in1:3VSB, in2:Vbat chip f8000-*: in0:+3.3V, in1:3VSB, in2:Vbat chip f81865f-*: in0:+3.3V, in5:3VSB, in6:Vbat chip adt7473-*: in2:+3.3V, temp2:Board Temp chip adt7475-*: in2:+3.3V, temp2:Board Temp chip adt7476-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp chip adt7490-*: in1:Vcore, in2:+3.3V, in3:+5V, in4:+12V, temp2:M/B Temp |
Conclusion
This article presented lexmachine a library that helps you break up complex strings into their component parts safely, reliably, and quickly. To learn more visit the project page, browse the documentation, look at another tutorial, or read an explainer article on how it all works.