GopherAcademy
Alexander Neumann
Dec 1, 2018 10 min read

Splitting Data with Content-Defined Chunking

In this post you’ll learn what Content-Defined Chunking (CDC) is and how you can use it to split large data into smaller blocks in a deterministic way. These blocks can be found again in other data later, even if the location of the block is different than the first time. I wrote a small Go package to do the splitting, which performs really well.

Why is splitting data an interesting problem?

In my spare time, I’m working on a fast backup program called restic, which is written in Go. When a (possible large) file is read, the program needs to save the data somewhere so that it can be restored later. I think it is a good idea to split files into smaller parts, which are more manageable for a program. It also allows detecting and efficiently handling small changes in big files, like virtual machine images. Once such a part of a file is saved, it can be referenced when the same data is contained in different files, so restic de-duplicates the data it reads. The parts are identified based on the SHA-256 hash of the contents, so a list of these hashes can describe the content of a file.

There are different strategies for splitting files, the most obvious one would be to just use static boundaries, and e.g. split after every megabyte of data. This gives us manageable chunks, but if the data in the file is shifted, for example by inserting a byte at the start of the file, all chunks will be different and need to be saved anew. We can do better than that.

Rabin fingerprints

During the research I did before starting to implement what would later become restic, I discovered a publication titled “Fingerprinting by Random Polynomials” by Michael O. Rabin published in 1981. It describes a way to efficiently compute a “Rabin fingerprint” (or “hash”, in modern terms) for a number of bytes. It has the great property that it can be computed as a rolling hash: the Rabin fingerprint of a window of bytes can be efficiently calculated while the window moves through a very large file. The idea is that when a new byte is to be processed, the oldest byte is first removed from the calculation, and the new byte is factored in.

How can we use that to find chunk boundaries? Restic uses a window of 64 bytes, so while reading a file, it computes the fingerprint of all such windows: byte 0 to byte 63, byte 1 to byte 64 and so on. It then checks every fingerprint if the least significant bits are all zero. If this is the case, then we found a chunk boundary! For random input bytes, the calculated fingerprints are roughly random as well, so by varying the number of bits that are checked we can (roughly) configure how large the resulting chunks will be. Since the cut points for individual chunks only depend on the 64 bytes preceding them, this method of splitting data is called “Content-Defined Chunking” (CDC).

Early on I recognized that the algorithm and implementation has the potential to be very helpful to other projects as well, so I published it as a separate package for other people to import (API on godoc.org) with a very liberal BSD 2-clause license.

Enough theory, let’s dive into the code!

Splitting data

The type which does all the hard work of computing the fingerprints over the sliding window is contained in the Chunker type. The Rabin fingerprint needs a so-called “irreducible polynomial” to work, which is encoded as an uint64. The chunker package exports the function RandomPolynomial() which creates a new polynomial for you to use.

Our first program will generate a random polynomial and create a new Chunker which reads data from os.Stdin:

1
2
3
4
5
6
7
8
p, err := chunker.RandomPolynomial()
if err != nil {
	panic(err)
}

fmt.Printf("using polynomial %v for splitting data\n", p)

chk := chunker.New(os.Stdin, p)

The Next() method on the Chunker reads data out the reader (os.Stdin in this example) into a byte slice buffer. The methods returns a next Chunk and an error. We call it repeatedly until io.EOF is returned, which tells us that all data has been read:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
buf := make([]byte, 16*1024*1024) // 16 MiB
for {
	chunk, err := chk.Next(buf)
	if err == io.EOF {
		break
	}

	if err != nil {
		panic(err)
	}

	fmt.Printf("%d\t%d\t%016x\n", chunk.Start, chunk.Length, chunk.Cut)
}

Let’s get some random data to test:

1
2
3
4
$ dd if=/dev/urandom bs=1M count=100 of=data
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0,548253 s, 191 MB/s

Now if we feed the data in this file to the program we’ve just written (code), it prints the start offset, length, and fingerprint for the chunk it found:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ cat data | go run main.go
using polynomial 0x3dea92648f6e83 for splitting data
0       2908784 000c31839a100000
2908784 2088949 001e8b4104900000
4997733 1404824 000a18d1f0200000
6402557 617070  001bc6ac84300000
7019627 1278326 001734984b000000
8297953 589913  0004cf802b600000
8887866 554526  001eb5a362900000
9442392 1307416 000ef1e549c00000
[...]
102769045       864607  00181b67df300000
103633652       592134  0000a0c8b4200000
104225786       631814  001d5ba20d38998b

On a second run, it’ll generate a new polynomial and the chunks will be different, so if you depend on your program computing the same boundaries, you’ll need to pass in the same polynomial.

Let’s fix the polynomial (0x3dea92648f6e83) for the next runs and change the program like this (code):

1
chk := chunker.New(os.Stdin, 0x3dea92648f6e83)

After replacing the randomly generated polynomial with a constant, every new run will give us the same chunks:

1
2
3
4
5
6
7
8
$ cat data | go run main.go
0       2908784 000c31839a100000
2908784 2088949 001e8b4104900000
4997733 1404824 000a18d1f0200000
[...]
102769045       864607  00181b67df300000
103633652       592134  0000a0c8b4200000
104225786       631814  001d5ba20d38998b

Now we can experiment with the data. For example, what happens when we insert bytes at the beginning? Let’s test that by inserting the bytes foo using the shell:

1
2
3
4
5
6
$ (echo -n foo; cat data) | go run main.go
0       2908787 000c31839a100000
2908787 2088949 001e8b4104900000
4997736 1404824 000a18d1f0200000
6402560 617070  001bc6ac84300000
[...]

We can see that the first chunk is different: it’s three bytes longer (2908787 versus 2908784), but the fingerprint at the end of the chunk is the same. All other chunks following the first one are also the same!

Let’s add a hash to identify the contents of the individual chunks (code):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
for {
	chunk, err := chk.Next(buf)
	if err == io.EOF {
		break
	}

	if err != nil {
		panic(err)
	}

	hash := sha256.Sum256(chunk.Data)
	fmt.Printf("%d\t%d\t%016x\t%032x\n", chunk.Start, chunk.Length, chunk.Cut, hash)
}

These are the chunks for the raw, unmodified file:

1
2
3
4
5
6
7
$ cat data | go run main.go
0       2908784 000c31839a100000        8d17f5f7326a2fd7[...]
2908784 2088949 001e8b4104900000        ba7c71226609c0de[...]
4997733 1404824 000a18d1f0200000        9c4db21626a27d4b[...]
6402557 617070  001bc6ac84300000        4c249c3b88500c23[...]
7019627 1278326 001734984b000000        5bec562b7ad37b8b[...]
[...]

Now we can see that adding bytes at the start only changes the first chunk:

1
2
3
4
5
6
$ (echo -n foo; cat data) | go run main.go
0       2908787 000c31839a100000        f41084d25bc273f2[...]
2908787 2088949 001e8b4104900000        ba7c71226609c0de[...]
4997736 1404824 000a18d1f0200000        9c4db21626a27d4b[...]
6402560 617070  001bc6ac84300000        4c249c3b88500c23[...]
7019630 1278326 001734984b000000        5bec562b7ad37b8b[...]

The SHA-256 hash for the first chunk changed from 8d17f5f7326a2fd7[...] to f41084d25bc273f2[...], but the rest are still the same. That’s pretty cool.

Let’s see how many different chunks we have in our data file by counting the different hashes:

1
2
$ cat data | go run main.go | cut -f4 | sort | uniq | wc -l
69

What happens when we feed the program our data twice, how many chunks does it detect? Let’s find out:

1
2
$ (cat data; cat data) | go run main.go | cut -f4 | sort | uniq | wc -l
70

Huh, so we only got a single additional chunk. The cause is that for the first round, it’ll find the same chunks as before (it’s the same data after all) right until the last chunk. In the previous run, the end of the last chunk was determined by the end of the file. Now there’s additional data, so the chunk continues. The next cut point will in the second run of the data, it’ll be the same as the first cut point, just with some additional data at the beginning of the chunk, so the SHA-256 hash is different. Afterwards, the same chunks follow in the same order.

Changing a few consecutive bytes of data somewhere in most cases also only affects a single chunk, let’s write the output to a file, change the data a bit using sed, and write another file:

1
2
$ cat data | go run main.go > orig.log
$ cat data | sed 's/EZE8HX/xxxxxx/' | go run main.go > mod.log

The string EZE8HX was present in my randomly generated file data only once and the sed command above changed it to the string xxxxxx. When we now compare the two log files using diff, we can see that for exactly one chunk the SHA-256 hash has changed, but the other chunks and the chunk boundaries stayed the same:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ diff -au orig.log mod.log
--- orig.log    2018-11-24 13:14:45.265329525 +0100
+++ mod.log     2018-11-24 13:19:32.344681407 +0100
@@ -55,7 +55,7 @@
 83545247       547876  00198b8839f00000        2cbeea803ba79d54[...]
 84093123       889631  0007ddacabc00000        a2b6e8651ae1ab69[...]
 84982754       2561935 000c53712f500000        bca589959b019a80[...]
-87544689       2744485 0000a49118b00000        02b125104f06ed85[...]
+87544689       2744485 0000a49118b00000        ce2e43496ae6e7c3[...]
 90289174       1167308 00034d6cce700000        ad25a331993d9db7[...]
 91456482       1719951 001d2fea8ae00000        ba5153845c5b5228[...]
 93176433       1362655 0003009fc1600000        8e7a35e340c10d61[...]

This technique can be used for many other things besides a backup program. For example, the program rsync uses content-defined chunking to efficiently transfer files by detecting which parts of the files are already present on the receiving side (with a different rolling hash). The Low Bandwidth Network File System (LBFS) uses CDC with Rabin fingerprints to only transfer chunks that are needed over the network. Another application of a rolling hash is finding strings in text, for example in the Rabin-Karp algorithm.

If you’re interested in how the chunker package is used in restic, there’s a post over at the restic blog which explains this in a bit more detail.

Performance

Performance is crucial for backup programs. If creating backups is too slow people tend to stop using it.

The chunker package is written in plain Go. It doesn’t have any fancy low-level assembler, but the code is already pretty fast, thanks to some optimized calculations. The package has some benchmarks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ go test -run xxx -bench 'BenchmarkChunker$'
goos: linux
goarch: amd64
pkg: github.com/restic/chunker
BenchmarkChunker-4            20          70342669 ns/op         477.01 MB/s
--- BENCH: BenchmarkChunker-4
    chunker_test.go:325: 23 chunks, average chunk size: 1458888 bytes
    chunker_test.go:325: 23 chunks, average chunk size: 1458888 bytes
PASS
ok      github.com/restic/chunker       1.655s

Running on an older machine with an Intel i5 CPU chunker processes at about 477 MB per second on a single core.

Conclusion

Content-Defined Chunking can be used to split data into smaller chunks in a deterministic way so that the chunks can be rediscovered if the data has slightly changed, even when new data has been inserted or data was removed. The chunker package provides a fast implementation in pure Go with a liberal license and a simple API. If you use the chunker package for something cool, please let us know by opening an issue in the repo over at GitHub!