String Matching
How do you search for a string? If it’s just once, strings.Index(text,
pattern)
is probably your best option. The standard library currently uses
Rabin-Karp to
search the text for the pattern. However, there are lots of different cases
for string searching, each of which has its own set of “best” algorithms.
Fortunately for us, many of them already have implementations in Go.
index/suffixarray
What if you had a single text you wanted to do lots of searches through for
different patterns? The
index/suffixarray package
implements a suffix array, which
allows substring searches in O(log n) time on the length of the text. Creating
the suffix array takes O(n log n) time, but you can quickly recoup that cost if
you’re doing lots of searches over the same text. And unlike strings.Index
,
the suffix array can return all the locations that match the pattern, not just
the first one.
|
|
The suffix array package is a hold-over from when godoc was in the standard
library; it powers the search box. You can also use the suffix array to search
across a large number of documents by joining all the texts into one
big string and mapping the offsets returned by Lookup()
back to the
original documents the strings appeared in.
This example is a bit more involved:
- create the
data
array and list of offsets for each document - create the suffix array from the combined data
- query the suffix array for all indexes where
ello
appears - search the offsets list for the document covering that offset
print out all the matched documents
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
package main import ( "fmt" "index/suffixarray" "sort" ) func main() { docs := []string{ "hello world", "worldly goods", "yello", "lowly", } var data []byte var offsets []int for _, d := range docs { data = append(data, []byte(d)...) offsets = append(offsets, len(data)) } sfx := suffixarray.New(data) query := "ello" idxs := sfx.Lookup([]byte(query), -1) var results []int for _, idx := range idxs { i := sort.Search(len(offsets), func(i int) bool { return offsets[i] > idx }) if idx+len(query) <= offsets[i] { results = append(results, i) } } fmt.Printf("%q is in documents %v\n", query, results) }
Aho-Corasick: matching a large number of patterns
A suffix array lets us search for a pattern in a (preprocessed) document. Lets flip this around and say we have a large number of patterns and want to know which ones appear in a given text. The Aho-Corasick string matching algorithm takes a set of patterns and creates a giant finite-state-machine which, when the document is passed to it, matches against all the patterns at once. CloudFlare has released an implementation in cloudflare/ahocorasick
Lets use it to match a document against a list of planets. Note that it even
finds the overlapping match for venusaturn
.
|
|
Trigram indexing
At the other end of the string matching scale, we have actual search engines like bleve and ferret. Both of these might be overkill for a small application that just needs a bit of string matching sped up. This was the situation I found myself in when working on carbonserver. Without getting into much detail, the problem involved matching file-system globs against a large number files (~700 thousand) in multiple nested directories. A profile showed that almost all the time was spent evaluating the globs, reading directories, and sorting file names. I wrote quick trigram indexing library dgryski/go-trigram and was able to reduce query times from 20ms to less than 1ms.
There are blog posts describing trigram indexing in detail, and I’ll give a quick example of using my trigram library. This time our documents will consist of Go conferences:
|
|
Note that with trigram matching, it’s important to make sure the resulting
documents actually contain the query string. It’s possible to have trigram
matches for a query, even if the document doesn’t actually contain them. For
example, the document GopherGoggles
, would match the query rGop
(trigrams:
rGo
, Gop
).
I’ve covered three different algorithms for string matching, but there are many more. The area of string searching is still an active research topic, with applications in a number of different fields.