Go at Sourcegraph - Serving Terabytes of Git Data, Tracing App Performance, and Caching HTTP Resources
At a high level, Sourcegraph has 2 parts. The first is Sourcegraph.com, the application that users see, whose architecture and code patterns we presented at Google I/O 2014. The second is srclib, our multi-language source code analysis engine, which is completely open source. Both of these parts are built using a number of libraries and systems that we’ve also released as open source, but we’re just going to cover those that Sourcegraph.com uses, since those are the most broadly useful.
Sourcegraph.com, our main application, manages fetching and updating several terabytes of VCS (git/hg) data, scheduling builds of projects, integrating with external APIs, storing user data, and serving the web app. Here’s what we’ve built with Go to make this possible.
To access and fetch git and hg repositories in Go, we wrote go-vcs. It provides a common Repository interface that has 4 implementations: git using git2go (a native Go git library that uses libgit2), git by shelling out to the “git” command, hg using hgo (a native Go hg library), and hg by shelling out to the “hg” command. There’s an extensive test suite (using Go’s testing package) that tests that the behavior of each implementation is identical.
In addition to providing VCS-specific methods such as GetCommit, ResolveBranch, Diff, and Commits (to get a list), the vcs.Repository interface can return a virtual FileSystem that can access files and directories as of a given revision. This FileSystem has the standard Open, Stat, ReadDir, etc., methods, which means it works with other libraries that expect this standard interface. It also lets us use mapfs to test it.
Here’s an example of using it to show a file at a specific revision:
To scale go-vcs to work on hundreds of thousands of repositories and
terabytes of data, we builtvcsstore. It has an HTTP server, which
provides HTTP handlers and an HTTP API to access data from any stored
repository (at URL paths like
/git/https/github.com/user/repo/.branches/mybranch), and an API
client, which implements the same vcs.Repository interface interface with methods
that issue HTTP requests to the vcsstore server. This means our code can
access remote repositories over HTTP as though they were local
repositories. By setting HTTP cache headers on the server and using a
caching HTTP transport, we get nearly automatic caching of VCS data,
which makes our app fast.
Web application integration testing using Selenium via go-selenium
While Go’s explicit error return values are nice in most cases, they can
lead to verbose test code (if you check every error, even ones unrelated
to the test at hand) or flaky test code (if you ignore error returns).
To solve this problem, go-selenium provides wrapper types WebDriverT
and WebElementT intended for use by test code, which combine a web driver
or element and a
*testing.T and call
t.Fatal if a method from the
underlying driver or element returns an error. This lets test authors
omit error checks but still report granular test errors.
Here’s what a test case looks like:
Test helpers in Go present another problem, though: if a helper function
t.Log (or anything that calls it, such as
the message is associated with the file and line in the helper function,
not in the test case that called the helper. We made a quick hack to
show the test case’s file and
line, which helps us
identify the source of test failures better:
Fast HTTP caching with httpcache, multicache, and s3cache
Our front-end app hits our HTTP API to fetch all the data it needs, so we can use standard HTTP caching techniques to cache data. (We’ve found this to be far simpler than if we had an application-specific cache, in Redis for example, and had to reinvent caching and eviction semantics and behavior.) At first our app just used Greg Jones’ httpcache, which provides a caching HTTP transport that writes to memory and a local disk. The beauty of net/http and Go’s trust in interfaces shines here; it was super easy to drop in this caching transport, and the rest of our application logic didn’t need to change.
But as we grew to multiple app servers and performed frequent redeploys, the hit rate of our servers’ memory and disk caches declined because each server’s cache was separate and was purged on each deploy.
Thankfully, it was easy to extend the httpcache.Cache interface:
However, the HTTPS latency to/from S3 was a significant overhead, so we built multicache, which let us specify cache policies such as “when reading, try the in-memory cache first, then disk, then S3” and “when writing, return after the item has been written to memory, but continue writing to disk and S3 asynchronously.” This gives us near-RAM speeds for most frequently used cache entries but with the high hit rates of using a remote persisted cache.
Distributed application tracing with apptrace
To improve performance in a web app that hits multiple services to serve each request, it’s useful to see the timings for each action, no matter which host it occurred on or how deep in the call stack it is. Just looking at the top-level page generation time isn’t enough. It’s also important to see metadata like HTTP cache headers to see what’s actually occurring and why things are slow.
We took ideas from Google’s Dapper paper, implementation tips from Twitter’s Zipkin, and code from Coda Hale’s lunk to create a distributed application tracing system in Go called apptrace. In keeping with the principles of simple distributed tracing in the Dapper paper, we get near-total visibility into the performance of our distributed services by instrumenting two external call points: external HTTP API calls by using an HTTP transport that records to apptrace, and SQL queries by wrapping modl.SqlExecutor.
We’ve been able to release these projects as open source in part because of Go’s easy composition. Interfaces make it easy to improve our app by providing better, faster implementations of an interface, without affecting the interface’s contract. We often develop these improved implementations in separate repositories so they can’t introduce complex interdependencies into our app. This is what occurred with our VCS data storage and our HTTP cache: we started out with a simple concrete implementation in our app’s main codebase and then developed improved implementations of the same interface in external repositories. Once done, open-sourcing these repositories is a no-brainer because they’re already standalone projects.
Also, Go makes it far easier to create separate projects than any other language we’re familiar with. All it takes is a .go file in a directory. Other languages require package description files, complex directory structures, install/setup scripts, etc.
We think all of this means that Go’s open-source ecosystem is far more mature than languages of comparable age and popularity. That, combined with a beautifully designed and implemented standard library, makes Go a joy to work with.
From all of us at Sourcegraph, we wish Go a very happy birthday!