Reactive planning and reconciliation in Go
I wrote a quick introduction about why I think reactive planning is a cloud native pattern and I published an article about control theory, but I have just scratched the surface of this topic obviously. I have a 470 pages to read from the book Designing Distributed Control Systems: A Pattern Language Approach. It will take me forever.
Introduction
It is easier to explain how much powerful reactive planning is looking at one example, I wrote it in go, and in this article I am explaining the most important parts.
Just to summarize, I think resiliency in modern application is crucial and very hard to achieve in practice, mainly because we need to implement and learn a set of patterns and rules. When I think about a solid application inside a microservices environment, or in a high distributed ecosystem my mind drives me into a different industry. I think about tractors, boilers, and what ever does not depend on a static state stored inside a database but on a dynamic source of truth.
When I think about an orchestrator it is clear to me that there is no way to trust a cache layer in order to understand how many resources (VMs, containers, pods) are running. We need to check them live because you never know what is happening to your workload. Those kinds of applications are sensible to latency, and they require a fast feedback loop.
That’s one of the reason about why when you read about Kubernetes internals you read about reconciliation loops and informers.
Our use case
I wrote a small PoC, it is an application that I called cucumber, it is available on GitHub and you can run it if you like.
It is a CLI tools that provisions a set of resources on AWS. The provisioned architecture is very simple. You can define a number of EC2 and, they will be created and assigned to a Route53 record, when the record does not exist the application will create it.
I learned about how to think about problems like that. At the beginning of my career the approach was simple, “I know what to do, I need to write a program that reads the request and does what need to be done”. So you start configuring the AWS client, parsing the request and making a few API requests.
Everything runs perfectly and you succeed at creating 100 clusters. Thing starts to be more complicated, you have more resources to provisioning like load balancers, subnets, security groups and more business logic related to who can do what. Requests start to be more than 5 at execution and the logic somethings does not work as linear as it was doing before. At this point you have a lot of conditions and figuring out where the procedure failed and how to fix the issue becomes very hard.
This is why my current approach is different when I recognize this kind of pattern I always start from the current state of the system.
You can question the fact that at the first execution it is obvious that nothing is there, you can just create what ever needs to be created. And I agree on that, but assuming that you do not know your starting points drives you to implement the workflow in a way that is idempotent. When you achieve this goal you can re-run the same workflow over and over again, if there is nothing to do the program won’t do anything otherwise it is smart enough to realize what needs to be done. In this way you can create something called reconciliation loop.
Reconciliation loop
The idea to re-run the procedures over and over assuming you do not know where
you left it is very powerful. Following our example, if the creation flow does
not end because AWS returned a 500 you won’t be stuck in a situation where you
do not know how to end the procedure, you will just wait for the next
re-execution of the flow and it will be able to figure what is already created.
In my example this patterns works great when provisioning the route53 DNS record
because the DNS propagation can take a lot of time and in order to realize if
the DNS record already exists or if there are the right amounts of IPs attached
to it I use the
net.LookupIP
, it
is the perfect procedure that can take an unknown amount of time to be
addressed.
Reactive planning
At the very least reconciliation loop can be explained as a simple loop
that
will execute a procedure forever but how do you write a workflow that is able to
understand the state of the system and autonomously make a plan to fix the gap
between current and desired state? This is what reactive planning does and
that’s why control theory is done!
|
|
Let’s start with a bit of Go. Procedure
and Plan
are the fundamental
interfaces to get familiar with:
Plan
are a collection ofProcedures
. TheCreate
function is able to figure out the state of system adding procedures dynamicallyProcedure
are the unit of work, they need to be as small as possible. The cool part about them is that they can return other procedures (and they can return other procedures as well) in this way build resilience. If a procedure returns an error thePlan
is marked as failed.1 2 3 4 5 6 7 8
// Scheduler takes a plan and it executes it. type Scheduler struct { // stepCounter keep track of the number of steps exectued by the scheduler. // It is used for debug and logged out at the end of every execution. stepCounter int // logger is an instance of the zap.Logger logger *zap.Logger }
Plan
and Procedure
are crucial, but we need a way to execute a plan, it is
called scheduler. The Scheduler
has an Execture
function that accept a
Plan
and it executes it until there is nothing left to do. Procedures can
returns other procedures it means that the scheduler needs to recursively
execute all the procedures.
The way the scheduler has to understand where the plan is done if via the number
of steps returned by the Plan.Create
function. The scheduler executes every
plan at last twice, if the second time there are not steps left it means that
the first execution succeeded.
|
|
The react
function implements the recursion and as you can see is the place
where the procedures get executed step.Do
.
|
|
All the primitives described in this section are in their go module called github.com/gianarb/planner that you can use. Other than what showed here the scheduler supports context cancellation and deadline. In this way you can set a timeout for every execution.
One of the next big feature I will develop is a reusable reconciliation loop for plans. In cucumber, it is very raw. Just a goroutine and a WaitGroup to keep the main process up:
1 2 3 4 5 6 7 8 9 10 11 12 |
go func() { logger := logger.With(zap.String("from", "reconciliation")) scheduler.WithLogger(logger) for { logger.Info("reconciliation loop started") if err := scheduler.Execute(ctx, &p); err != nil { logger.With(zap.Error(err)).Warn("cucumber reconciliation failed.") } time.Sleep(10 * time.Second) logger.Info("reconciliation loop ended") } }() |
But this is too simple and it does not work in a distributed environment where only one process should run the reconciliation and not all the replicas.
I wrote this code to help myself to internalize and explain what reactive plans means. And also because I think the go community has a lot of great tools that make uses of this concept like Terraform, Kubernetes but there are not low level or simple to understand pieces of code. The next chapter describes how to write your own control plan using reactive planning.
Theory applied to cucumber…
Let’s start looking at the main
function:
|
|
In cucumber there is only one Plan the CreationPlan
. We create it based on the
YAML file that contains the requested cluster. For example:
|
|
And it gets executed by the scheduler. As you can see if the schedule returns an error we do not exit, we do not kill the process. We do not panic! Because we know that things can break and our code is designed to break just a little and in way it can be recovered.
After the first execution the process spins up a goroutine that is the one I copied above to explain a raw and simple control loop.
The process stays in the loop until we kill the process.
In order to test the reconciliation you can try to remove one or more EC2 or the DNS record, watching the logs you will see how inside the loop the scheduler executes the plan and reconcile the state of the system in AWS with the one you described in the YAML.
|
|
This is the command I uses to start the process.
The steps I wrote in cucumber are not many and you can find them inside
./cucumber/step
:
- create_dns_record
- reconcile_nodes
- run_instance
- update_dns_record
run_instance
for example is very small, it interacts with AWS via the go-sdk
and it creates an EC2:
|
|
As you can see the unique situation where I return an error is if the
ec2.RunInstance
fails, but only because this is a simple implementation.
Moving forward you can replace the return of that error with other steps, for
example you can terminate the cluster and cleanup, in this way you won’t left
broken cluster around, or if you try other steps to recover from that error
leaving at the next executions (from the reconciliation loop) to end the
workflow.
From my experience reactive planning makes refactoring or development very modular, as you can see you do not need to make all the flow rock solid since day one, because it is very time-consuming, but you always have a clear entrypoint for future work. Everywhere you return or log an error can be replaced at some point with steps, making your flow rock solid from the observation you do from previous run.
The reconcile_nodes
is another interesting steps. Because run_insance
only
calls AWS and it creates one node, but as you can image we need to create or
terminate a random amount of them depending on the current state of the system.
- if you required 3 EC2 but there are zero of them you need to run 3 new nodes
- if there are 2 nodes but your required 3 we need 1 more
- if there are 56 nodes but you required 3 of them we need to terminate 63 EC2s
The reconcile_nodes
procedures makes that calculation and returns the right
steps:
|
|
As you can see I have only implemented the RunInstnace
step, and there is a
TODO
left in the code, it means that scale down does not work for now.
It returns the right steps required to matches the desired state, if there are 2
nodes, but we required 3 of them this steps will return only one RunInstance
that will be executed by the scheduler.
Last interesting part of the code is the CreatePlan.Create
function. This is
where the magic happens. As we saw the schedulers calls the Create
functions
at least twice for every execution and its responsability is to measure the
current state and according to it calculate the required steps to achieve that
we desire. It is a long function that you have in the repo, but this is an idea:
|
|
The code checks if the number of running instances are equals to the desired
one, if they are different it calls the ReconcileNodes
procedure.
Conclusion
This is it! It is a long article but there is code and a repository you can run! I am enthusiast about this pattern and the work exposed here because I think it makes it clear and I tried to keep the context as small as possible to stay focused on the workflow and the design.
Let me know if you will end up using it! Or if you already do how it is going @gianarb.