Probabilistic Data Structures for Go
Imagine you had access logs for a very high traffic website. How would you
determine how many different IP addresses accessed your site? Or how many hits
from a particular IP? Or which ones accessed it the most? Assuming IPv4
addresses, you could use a map[uint32]int
to maintain the counts, but that
could end up using a lot of memory. It’s certainly possible to have a map with
4 billion entries, and a real log server wouldn’t have accesses from every
single valid IP address, but the problem still exists.
Luckily, there’s a class of algorithms that lets you trade memory usage for accuracy. In many cases the reduction in memory can be significant and the drop in accuracy is minimal. The go-probably library by Dustin Sallings implements a number of these basic data structures and algorithms. (It’s almost always better to process things exactly, if you have the memory though. These algorithms can be slower than exact answers for small data sets.)
I also like this package because it was one of the first I contributed to on GitHub, and my first commit-bit on somebody else’s repository.
Let’s look at how the types in this package would handle the above problems.
HyperLogLog: cardinality estimation
The algorithm we’re going to use for cardinality estimation (i.e., counting
distinct items in our set) is
HyperLogLog. I’m not going to
explain the math (there are already good blog
posts
for that), only how to use the implementation in go-probably
.
An abridged look at at the API shows:
|
|
First we construct an instance of a HyperLogLog estimator with
NewHyperLogLog()
. Then for each item, we need to pass a 32-bit hash of the
value to Add()
, and at the end we call Count()
to get the estimate.
For this example I’m using crc32 as our hash function. It works “good enough” and has the advantage of being in the standard library. In production I might use murmur3 or xxhash, both of which are faster but are external dependencies.)
|
|
Count-Min Sketch: approximate frequencies
The next data structure provided by go-probably
is Count-Min
Sketch. A CM-sketch
lets you estimate how many times you’ve seen different elements in your
data set. A count-min sketch is similar to a Bloom
filter in that they’re both
probabilistic, but a Bloom filter returns a boolean “Have I seen this”, rather
than an estimated count.
A count-min sketch is useful for on-line queries, since if you knew which addresses you wanted to track you would do so as you saw them in the input.
The API exposed by the Sketch type is a bit more complex, but for most uses you can focus on the three methods which are similar to those for HyperLogLog:
|
|
Unlike with HyperLogLog, we don’t need to provide a hash function; Add()
uses
its own hash function internally. This demo program reads an input file and
then prompts the user for entries to provide estimated counts for.
|
|
TopK queries
Because a Count-Min sketch contains estimated counts, it’s fairly straightforward to use it along with a heap to estimate the most popular elements in a stream. However, I prefer a much more magical algorithm that includes a tiny sketch as one of its sub-pieces. I’ve implemented it in dgryski/go-topk
Putting it together: Hokusai
A good example of using Count-Min sketches in a real system is the Hokusai paper, by Sergiy Matusevych, Alex Smola, Amr Ahmed. It tracks all its keys and creates aggregate sketches over time, allowing point queries when the full set of keys would be prohibitively large to store. The example given in the paper is tracking the popularity of all the different search queries provided to a search engine in a given week, and being able to plot how often a given search query occurred.
I have a basic implementation this system in dgryski/hokusai.
A number of my recent patches to go-probably
came from my needs while
implementing this paper.
Further Reading
The go-probably source code has links to all the papers describing the algorithms, most of which are quite readable.
The Highly Scalable blog had a good post on Probabilistic Data Structures for Web Analytics and Data Mining that gives a good overview how some of these algorithms actually work.
Google has a modification to HyperLogLog called HyperLogLog++. There’s a good analysis of the changes they’ve made at HyperLogLog++: Google’s Take On Engineering HLL. There is a Go implementation at clarkduvall/hyperloglog.