Testing for Resiliency

Understanding Resiliency

Many tests stop at unit and business use cases workflow with both positive and negative scenarios. In the world of microservices where a webservice depends on many other microservices to accomplish a complex task, building a resilient service is an implicit customer expectation that we need to build in, just like any feature.

First, let’s take a look at a few definitions for resiliency.

Resilience (from wikipedia): Resilience is the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation.
A great book to read on resiliency is Release It! by Michael T. Nygard where he wrote -: A resilient system keeps processing transactions, even when there are transient impulses, persistent stresses, or component failures disrupting normal processing. This is what most people mean when they just say stability. It’s not just that your individual servers or applications stay up and running but rather that the user can still get work done.

Understanding resiliency is the first step, what comes next? Here are some questions that I would ask -

How do I find out the services that threaten my service stability?
How do I simulate instability of the services that I depend on?
How can I ensure that the stability of my service does not regress/drift over time?

There are many tools that tackle each question (note. there is not one tool that does all three well). In the next section, I’ll list the tools that you might be interested to look into.

Discovery of dependencies

Finding out dependencies should be the first and foremost step for resiliency testing. Here is the list of popular tools -

Hystrix Network Auditor Agent was the tool we went with as our application is built with Hystrix fault tolerance library. I will describe in my next blog an extension I wrote to the audtior agent so that you do not have to modify application code as well as other productivity improvements.

Simulating instability conditions

Depending on the application/service you are testing, you may need a proxy for http, and/or non-http. If your services are hosted in AWS, Chaos Monkey from Netflix is hands-down the most aggressive resiliency testing tool.

HTTP proxies

For HTTP dependencies, here are some criteria to consider when evaluating the tools

Supports whitelist, ignore, exclude hosts
Supports http and https traffic proxy
Forward proxy traffic through another proxy
Apply latency delay by target host and port
Apply latency delay by http operation
Record request and response for stubbing
Configurable delay (short/long/before timeout/after timeout etc.)
Customizable response (body, multipart, header, response code etc.)
Supports matching rules for hostname, port, uri (full/partial), body, multiplicity
Easy to automate
No modification of application code
Supports various fault injection
Detailed logging of traffic and payload

Some popular open source tools are

WireMock and Node http proxy are both promising tools. Note that WireMock only supports one http dependency per WireMock instance.

Non-http proxies

For non-HTTP dependencies (e.g. UDP/TCP, SOCKS5), in addition to the criteria above, here are some additional considerations that may apply to you -

Reverse DNS lookup of IP addresses
Supports closing of sockets
Able to listen to websockets

Available open source tools are

Other resiliency testing tools

There are a few other categories worth noting and they may apply to your testing. They are

JDBC proxy
- SSL-SQL-Proxy-Server
- P6Spy
- HA-JDBC
- Proxool
- Virtual JDBC
- LDBC
Platform dependent tools
- Mac e.g. ipfw/pfctl
- Linux e.g. tc, iptables
Monitoring

Conclusion

Once you have a good understanding of what you need to test for resiliency, I hope that by shortlisting these tools by category and criteria will help you narrow down and pick the right tool for your resiliency testing toolbox. Having a set of principles when evaluating tools is also highly recommended. For example, our principles are

Prioritize mature OST(Open Source Tools) over COTS (Commercial Off-The-Shelf) and COTS over homegrown DIY
Test development using these tools should be simple, easy and the resulting code readable
Tools should be easily extended, customized, and/or a thin abstraction layer added to serve application specific needs
Tools must be a cost effective solution considering purchase, training, development and maintenance costs
Tools should be scalable to enable and increase engineers productivity in all scrum teams

I welcome any questions, insights or suggestions you have. Please feel free to email me or leave comments on this blog. Thank you for reading!