So you’re a shepherd to a herd of microservices. Cool. You get that additional flexibility and clear separation of concerns in your system, whose parts are nicely decoupled. But… do you sleep well at night? Are you sure you won’t be woken up by your alerting?
Sure, you’ve tested all corner cases of your application, and Sonar shows you have great coverage. But are you really (no, I mean REALLY!) ready for what might happen in production? When that one particular packet is lost between live versions of your services? Or when your service’s machine died just when it was doing a disk write? In large datacenters, multiple physical machines crash every day. Not to mention temporary network partitions, disconnections, delays, package reordering and who knows what else. One cannot just hope it won’t happen – it will. And the best thing to do is to get prepared for it.
What we use Docker for
Your application needs to be tested for all that. And what better way to test the behavior of a system in doomsday conditions, than actually putting it in them? That’s fault injection testing. There are two requirements here. Firstly, we want as production-like conditions as possible. Using Zookeeper cluster? We want the cluster, not a mock or echo server. System consists of many replicated services with a load balancer in front? We want that too, with services running in separated contexts. That’s what we use Docker for. By spinning elements of infrastructure in containers, we may mimic our production environment quite well.
The second requirement is to have a repeatable, automated test suite. Every type of fault must be injected in a precisely specified moment. Package reordering, package loss, network partitions, delays, dead service, cluster quorum loss – all this should be a part of replayable scenario to test the behavior of the system in given conditions. It’s easy to build it atop Docker with Pumba and BATS.
Fault injection tests
We’ve taken that approach during designing, implementing and testing Elasticproxy – dynamic router and traffic balancer, meant to provide transparent scalability and configurability of Elasticsearch clusters. Proposed solution is distributed, fault-tolerant and highly available in nature. Consists of multiple replicated applications communicating over network, having many moving parts and potential failure points. Fault injection test kit seems the right thing to use in the additional layer of testing.
If you want to make sure your distributed cloud application will survive not only mild autumn storms, but hurricanes, earthquakes and apocalypse as well – this video is for you: