Lessons in Go Learned while Implementing SHIELD
Quick background: What is SHIELD?
SHIELD is a backup solution for Cloud Foundry and BOSH deployed services such as Redis, PostgreSQL, and Docker. (For the interested, here is a quick summary of the basics of BOSH and Cloud Foundry.) The original design was inspired by a client’s need to have both broad and granular backups of their private Cloud Foundry and its ecosystem. Specifically, in addition to being able to recover from a meteor strike they also wanted to be able to create more granular backups so they could restore specific containers, credentials, databases, and so on. Since there was not an existing backup solution of this type available for Cloud Foundry/etc., we designed a new solution named SHIELD.
Functions as Fields in Structs
Something we employed both in the CLI and in the server-side code was the ability to validate the received data to help keep out both non-sense and malicious values. Since different fields were expected to have different inputs, we ended up with several validation functions checking everything from whether input is appropriately structured and valid JSON to checking the values and types. There were so many different functions floating around and we needed a way to consistently work them into the input validation process.
The example code below is a scaled down version of the process. There are
Field structs, but there is also a function type declared
FieldValidator type allows any function to be
stored in the
Validator field as long as it (1) takes in a
string and an
int and (2) outputs an
error. You can see from the example that this
holds true not just for the named functions
InputIsNotSmallEnough but anonymous functions as well. The value in
Value is evaluated when the form is shown and instead of some complicated
legwork to match the appropriate validation functions to their field outside
the struct, I can just call the validation function stored in the struct as
Click here to test on the Go Playground.
Using Interfaces with JSON
Some of you may be familiar with the concept of “function overloading”. For
those who aren’t, function overloading is when a language permits you to
create multiple functions with the same name. A trivial example might be to
add functions: one for adding ints and another for adding
floats. This allows you to avoid cluttering your code with function names
BUT: Go explicitly doesn’t support function overloading (fourth to last bullet point). They provide a quick explanation about why on their FAQ page here. And that’s ok - we don’t really need function overloading to accomplish our goals. Why? Because we have interfaces!
In the SHIELD project we were mostly concerned with this for error handling in JSON: we wanted to use a single function for writing JSON errors to the client calling the API endpoint. Tying this into what we did above with validation functions in structs, interfaces provided a consistent way to pass JSON error payloads from different sources. Simplified example:
Click here to test on the Go Playground.
A Race Condition in the Pipes
First, what is a race condition?
“A race condition occurs when one goroutine modifies a variable and another reads it or modifies it without any synchronization.” –Source
The code that caused our race condition was a result of the way we had originally tried to implement the backup and restore processes. SHIELD has both target and store plugins so a user can select what type of data is being backed up or restored (e.g. PostgreSQL) and what type of storage the data is being backed up to or restored from (e.g. S3). To run the backups and restores, we created stdin, stdout, stderr pipes for both the target and store. In the case of a backup, the store read what was being sent by the target (i.e. the target’s standard output was piped into the store’s standard input). Likewise, during a restore the target read what was being sent by the store (i.e. the store’s standard output was piped into the target’s standard input).
While this conceptually makes sense and should work, we ran into issues during testing when we started to see non-reproducible, inconsistent, and seemingly random failures when we deployed the dev release to our testing environment. Things like leaking pipes where not all the data from stdout was making it to the corresponding stdin and corrupted archives. We upped the ante on our testing adding in various payload sizes, random sleeps, etc. but no matter what we did to our tests they still all passed. After our attempts to expose the issues we were experiencing with tests failed, we began to suspect we had created non-deterministic code. But how? To help show what happened, here’s a trimmed down example:
Due to limitations in Go Playground this code must be run locally to reproduce, please see example code on Github.
In this example,
sort are taking over the roles of
store from SHIELD. Skimming over the code you can see that it is going to
drain stdout and stderr into their respective pipes to
sort the output of
sort.Start is run first so that
sort is ready to sort
the output of
ls, similar to the way that store would wait to read the
output of target (or vice versa). At this point,
sort is running in the
background and will continue to do so until its stdin is complete. Since
sort.Stdin is defined as
ls.StdoutPipe, that means waiting for the
ls.StdoutPipe to complete. Then Go hits the
Wait command. Taking a look
Wait and then looking back at the code, we can see that
the descriptors that
Wait needs to close before exiting are the read and
stdout pipes. This read is the same read that
Drain is trying to read
rd) from. Since
Wait is trying to close what
Drain is trying to read,
the program fails with a data race condition like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
$ go build -race $ ./shield-race ================== WARNING: DATA RACE Write by main goroutine: os.(*file).close() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/file_unix.go:129 +0x1be os.(*File).Close() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/file_unix.go:118 +0x88 os/exec.(*Cmd).closeDescriptors() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/exec/exec.go:241 +0x9f os/exec.(*Cmd).Wait() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/exec/exec.go:390 +0x4c4 main.main() /Users/quinn/dev/go/src/github.com/starkandwayne/shield-race/main.go:41 +0x3b7 Previous read by goroutine 6: os.(*File).read() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/file_unix.go:211 +0x84 os.(*File).Read() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/file.go:95 +0xbc bufio.(*Scanner).Scan() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/bufio/scan.go:180 +0x78a main.Drain() /Users/quinn/dev/go/src/github.com/starkandwayne/shield-race/main.go:12 +0x1a8 Goroutine 6 (running) created at: main.main() /Users/quinn/dev/go/src/github.com/starkandwayne/shield-race/main.go:32 +0x27b ================== output> ---------- 1 root admin 0B Sep 15 23:00 .file output> -rw-r--r--@ 1 root wheel 313B Aug 22 22:35 installer.failurerequests output> -rw-rw-r-- 1 root admin 0B Aug 22 17:35 .DS_Store output> d--x--x--x 9 root wheel 306B Dec 9 20:52 .DocumentRevisions-V100 output> d-wx-wx-wt 2 root wheel 68B Sep 1 18:30 .Trashes output> dr-xr-xr-x 2 root wheel 1B Dec 9 20:52 home output> dr-xr-xr-x 2 root wheel 1B Dec 9 20:52 net output> dr-xr-xr-x 3 root wheel 4.1K Dec 9 20:52 dev output> drwx------ 5 root wheel 170B Apr 11 2015 .Spotlight-V100 output> drwx------ 1927 root wheel 64K Dec 21 16:00 .fseventsd output> drwxr-xr-x 2 root admin 68B Jan 8 2015 .PKInstallSandboxManager output> drwxr-xr-x 3 root wheel 102B Dec 21 15:23 src ================== WARNING: DATA RACE output> drwxr-xr-x 6 root admin 204B Oct 13 14:18 Users Write by main goroutine: os.(*file).close() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/file_unix.go:129 +0x1be os.(*File).Close() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/file_unix.go:118 +0x88 os/exec.(*Cmd).closeDescriptors() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/exec/exec.go:241 +0x9f output> drwxr-xr-x 33 root wheel 1.2K Dec 21 15:23 . os/exec.(*Cmd).Wait() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/exec/exec.go:390 +0x4c4 main.main() /Users/quinn/dev/go/src/github.com/starkandwayne/shield-race/main.go:42 +0x3dc Previous read by goroutine 7: os.(*File).read() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/file_unix.go:211 +0x84 os.(*File).Read() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/os/file.go:95 +0xbc bufio.(*Scanner).Scan() /private/var/folders/q8/bf_4b1ts2zj0l7b0p1dv36lr0000gp/T/workdir/go/src/bufio/scan.go:180 +0x78a output> drwxr-xr-x 33 root wheel 1.2K Dec 21 15:23 .. main.Drain() /Users/quinn/dev/go/src/github.com/starkandwayne/shield-race/main.go:12 +0x1a8 Goroutine 7 (finished) created at: main.main() /Users/quinn/dev/go/src/github.com/starkandwayne/shield-race/main.go:33 +0x2ed ================== output> drwxr-xr-x+ 3 root wheel 102B Nov 10 01:46 .MobileBackups output> drwxr-xr-x+ 67 root wheel 2.2K Nov 24 19:23 Library output> drwxr-xr-x@ 2 root wheel 68B Oct 13 14:18 .vol output> drwxr-xr-x@ 2 root wheel 68B Oct 13 14:18 Network output> drwxr-xr-x@ 4 root wheel 136B Dec 9 20:50 System output> drwxr-xr-x@ 6 root wheel 204B Oct 13 14:18 private output> drwxr-xr-x@ 13 root wheel 442B Nov 24 19:23 usr output> drwxr-xr-x@ 39 root wheel 1.3K Dec 9 20:50 bin output> drwxr-xr-x@ 59 root wheel 2.0K Dec 9 20:50 sbin output> drwxrwxr-t@ 2 root admin 68B Oct 13 14:18 cores output> drwxrwxr-x+ 75 root admin 2.5K Dec 18 21:56 Applications output> drwxrwxr-x@ 3 root wheel 102B Sep 2 15:11 opt output> drwxrwxrwt@ 5 root admin 170B Dec 21 17:03 Volumes output> lrwxr-xr-x@ 1 root wheel 11B Oct 13 14:17 etc -> private/etc output> lrwxr-xr-x@ 1 root wheel 11B Oct 13 14:17 tmp -> private/tmp output> lrwxr-xr-x@ 1 root wheel 11B Oct 13 14:18 var -> private/var output> total 45 Found 2 data race(s)
Note that two data races are found - one for each of the goroutines draining the stderr pipes.
Since this example is a lot simpler than the one we ran into in SHIELD it
still works in these sense that when you build the binary without the data
go build -race, the code will compile and run
as expected. This is because you are essentially just running
ls | sort
which, unless you have a truly massive set of subdirectories, should resolve
itself relatively quickly. The same cannot be said of of the similar data
race we created and found in SHIELD.
To see exactly how the race condition appeared in SHIELD at that time, take
a look at task.go in commit 6020baa. When we researched our
data race, we discovered that it is a known problem with exec’ing certain
commands/pipes (see here, here, and here). As
a result, we ended up resolving the issue by implementing a BASH
script to handle the pipes. The first commit with this fix is
3548036. Since then, the relevant go code has been
Open Invitation: Want to test and contribute to SHIELD?
To setup a local testing environment for SHIELD, please use the
script in our testbed repository after setting up a local
Cloud Foundry on BOSH Lite. Our testbed deploys Cloud Foundry
and docker-postgres with SHIELD and sets up some initial dummy values in
The CLI is pretty straightforward - for example
shield create target and
shield list stores. There are also some aliased commands, e.g.
shield list. For a full list of commands, please take a look at the
CLI documentation on the project README.
We welcome feedback and pull requests!
I hope everyone is enjoying their winter holidays and Merry Christmas to those who celebrate! :)