Why SREs Need to Embrace Chaos Engineering
Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability.
Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled. Hardware can fail, the software can break, and a vast, pitching ocean of services and competing standards beneath the modern app. And, of course, end-users often have an uncanny tendency to find every possible case where your app misbehaves accidentally.
The understanding that there’s no way to account for everything that might go wrong in production inspired the formation of chaos engineering. During Netflix’s 2010 migration to the cloud, the company developed a fault-tolerant system to withstand multiple components’ simultaneous failures.
Good design, however, doesn’t guarantee that high efficacy. Gauging how a system works in a sterile testing environment makes no guarantees of how it will operate in production. To see how fault-tolerant the system was, engineer Greg Orzell created a tool that would induce real-world failure conditions by terminating real instances at scheduled times and at random.
This project would come to be known as Chaos Monkey. By simulating realistic scenarios for hardware and software failures, the engineers at Netflix could find weaknesses in their design and find opportunities to improve their system’s reliability.
Eventually, Netflix would expand Chaos Monkey into an entire Simian Army, including tools like Latency Monkey, Security Monkey, and Conformity Monkey, all designed to simulate failures or identify abnormalities that could indicate opportunities for improvement. Although Netflix later ended support for the Simian Army, the company spun off its most popular tools into standalone projects or integrated them with Spinnaker.
The Intersection of Chaos Engineering and Site Reliability Engineering
Site Reliability Engineering (SRE) is a technique pioneered by Google. SRE treats operations “as if it’s a software problem.” Examining the details of software systems, SREs continuously test and measure any aspect of the system that can impact reliability. The goal of SRE is to minimize downtime and ensure systems are performing as expected.
One of the primary goals of SRE teams is to ensure that systems can maintain their Service-Level Agreements (SLAs), the system uptime promised to customers, and Service-Level Objectives (SLOs), which are the specific metrics SRE teams target to meet the business’ SLAs.
Chaos Monkey’s legacy created an entire sub-industry of tools well-suited to the needs of SRE teams attempting to address reliability concerns before they cause serious problems preemptively.
Despite chaos engineering being a relatively new practice, 60 percent of respondents to a survey for the 2021 State of Chaos Engineering report had performed at least one chaos experiment, citing improved availability and reduced Mean Time To Resolution (MTTR) as benefits of chaos engineering. Teams that more frequently run chaos experiments claim over 99.9 percent availability for their services.
By inducing real failures within the guardrails of a chaos engineering experiment, SRE teams gain valuable system reliability insights they’d be unlikely to reach through purely theoretical modeling alone. Rather than dealing with hidden defects in your infrastructure that can cause incidents—or worse yet, stumbling over them during your incident response—you can apply your real post-incident process to real system outputs to get a more accurate view of your system’s state and capabilities.
Creating Chaos Experiments
There are four fundamental steps to defining any chaos engineering experiment:
- Define the steady-state of the application. What does it look like when everything is performing as expected? What do metrics look like during normal operation?
- Ensure you have the proper observability tools to monitor and log the system’s behavior. This means tracking essential metrics, such as resource usage, latency, uptime, and other metrics relevant to system reliability.
- Create a hypothesis. Take a best guess at how the system might handle real-world problems. What happens if the application runs out of memory or disk space? What happens if a hardware failure disables a processor? What happens during a power outage?
- Run the experiment. Use chaos engineering tools to simulate the incident described in your hypothesis. Observe what happens. Did the system perform as it did in your hypothesis? If not, what was different? This is a broad category. It could be discovering unexpected bugs when a particular microservice is unavailable or identifying emergent behaviors when a service becomes starved for computing resources.
After running an experiment, your team returns to their hypothesis to evaluate.
Both passes and failures are considered successful chaos experiments. If your system behaved as you predicted, you understand the system well and how it responds to stress. If the experiment produced results that aren’t aligned with your hypothesis, your team has a clearer sense of direction toward weak points in your system.
Discovering these issues before they cause outages lets you proactively address them and adhere to your SLAs. Even if you don’t induce a failure, it’s a great opportunity to understand your system better and develop a more effective hypothesis for your next experiment.
Stress Testing Incident Response
Accurately gauging your organization’s response capabilities means testing your teams and the infrastructure. Chaos experiments are often tested in production environments with little or no warning. Members of the SRE team may have implemented the experiment, but to the rest of your incident response team, this serves as an unplanned drill.
Using chaos experiments is an excellent way to stress test your entire incident response, from the initial detection to post-incident analysis. Even without implementing any solutions, chaos testing confirms that your incident responders can manually recover the system within SLAs by testing their response times and communication quality. After, you can evaluate the efficacy of your runbooks and refine your processes with the insights you collect.
A well-designed chaos test is a controlled experiment. Despite being executed in a production environment, it should provide guardrails to limit its effects. You don’t want your readiness testing to devolve into an actual incident. You should be able to resolve it immediately by terminating the test if necessary.
See What Chaos Can Teach You
After a simple origin at Netflix, chaos engineering has developed into a separate specialization. It provides an entire ecosystem of techniques, theories, and a broad set of tools, including Kubernetes-native offerings, open-source tools by Alibaba, and version-controllable, cross-platform tools like Cthulhu.
Well-designed chaos tests are invaluable to any SRE or incident response team. The tools and techniques of chaos engineering are rapidly becoming an industry standard for improving software reliability and evaluating incident response effectiveness.
Interested in improving the quality of your incident response? Take a look at how xMatters can help enhance your incident management practices.