Or, taming connection state and thundering herds.
There comes a time in the life of many programs when you need to maintain a persistent connection to a server. When that server goes down (as it will), you’ll need to reconnect.
A reconnect loop needs to do two things: to increase the wait time between attempts after repeated failures (i.e. backoff); and to randomize wait times to avoid the thundering herd problem.
The stateful start.
There are several libraries for Go that implement backoff with delay, and their APIs are similar to this:
I did this at first too, and I found the approach to be limiting:
It turns control of your function over to the backoff package, making it more difficult to reason about program flow since you don’t know for sure when it might be called.
It doesn’t allow you to restart a backoff process mid-try if you get a signal that you should retry from zero. You can pass in a retry channel to support this, but it complicates the API for simpler uses and offers little flexibility in backoff logic when a retry needs to occur.
It doesn’t allow you to easily change behavior between fatal and
retryable errors in
f(). You could add a backoff package error
wrapper for these, but then
f() needs to be aware of them. That
muddies the boundaries between your code and the backoff package.
Any outside state must be updated in the body of
f(), another blow
to reasoning about program flow.
A stateless alternative.
I’ve used the following as an alternative in a couple of projects:
Typical use looks like this:
You can find real world usage in this pumpWrites function, which maintains a persistent connection to a Scribe server for writing call trace data.
This has several advantages:
The calling function has complete control over program flow. It can
time.Sleep() to backoff for simplicity, or it can use
select if other signals are in play.
e.g. you have a
done channel to signaling that retries should stop:
It allows you to reset the backoff counter in the middle of a retry if you have an out-of-band reset signal (e.g. ZooKeeper notifying you the server is back up):
The backoff policy itself is data only and holds no backoff state, allowing safe concurrent use and no need for allocations when many processes are backing off at the same time.
All state, including the connection itself and the attempt counter, is
held entirely in the
retryConn function. You can reason about the
state of the connection without looking anywhere else. The attempt
counter is simply an integer, not a property hidden behind an opaque
struct. It has a zero value that makes sense.
If a fatal error occurs when dealing with
conn, you can easily
return it from
retryConn without disrupting another process. You can
decide whether an error is fatal without involving the backoff policy.
Since it’s about 30 lines of code, it can be easily copied into any project without introducing an external dependency.
Most important is that the function looks like what it does. This is a thing I love about programming in Go. There’s no hidden state or magic behavior. The normal function flow lives at a reasonable two levels of indentation.
I’ve always found this array-based backoff sufficient, but if you want
to calculate the delays, you can elevate
Duration() into an
interface and implement that however you’d like:
And always remember to seed your random number generator!