Our lives have become entirely dependent on technology in less than a decade. Modern systems are built at scale and operate in a decentralized manner. The Internet platform has become the main focus for many industries. But it is still a new technology, and emerging and developed economies are still trying to establish the infrastructure and ecosystem necessary for these companies to operate online. With scale comes complexity and many ways these large-scale distributed systems can fail.
These outages/interruptions often occur in complex and distributed systems where many things fail simultaneously, exacerbating the problem. Depending on the system architecture, searching for and fixing errors takes a few minutes to an hour. This time delay leads not only to a loss of revenue for the company but also to a loss of customer trust. This situation helps us think about the difference between mean time to failure (MTTF) and mean time to recovery (MTTR).
Systems built on modern cloud based technologies and microservices architecture have many dependencies. System errors usually occur unexpectedly. We cannot control or prevent failures of distributed systems. However, we can maintain the failure’s radius and optimize the systems’ recovery and recovery time. This can be achieved if we use as many failures as possible in the test lab to gain confidence in the system’s reliability.
In addition, we will focus even more on control system data before, during, and after a failure. This is, of course, for self-healing, but it also provides information for later analysis and improvement.
Netflix’s way of dealing with the system taught us a better approach and spawned a new discipline called “Chaos Engineering.”
The Need for Chaos Engineering and Chaos Testing
Chaos engineering is the discipline of experimenting with a software system in production to build confidence in the system’s ability to withstand turbulent and unexpected conditions.
Chaos testing means to crash a production system purposefully. This will help us calculate mean time to recovery (MTTR) and minimizes the time to recover from a disruption. Organizations or industries can improve the application’s resilience by performing chaotic testing.
Cloud infrastructure can fail for many reasons. Below are the most common error messages used for distributed systems and microservices architectures deployed in the cloud.
- Outage of an entire region
- Power failures
- Data Service Failure
- Unexpected increase in user traffic
- Natural disasters
- Cyber attacks such as DDoS
- Code injection
- Exhausted resources
- Hardware complications
However, we can control these failures’ radius and optimize the systems’ recovery and recovery time by using as many failures as possible with the controlled experiment to gain confidence in the system’s reliability.
Principles of Chaos Engineering and Testing
Define the normal behavior of the system: The steady state can be defined as some measurable output, such as overall performance, error rate, or system latency, that indicates normal behavior. The normal state of the system should be viewed as a steady state.
Build a hypothesis around a steady state: The hypothesis of the experiments must be compatible with the goal of Chaos testing. The measurable return of the system must be the prime focus. Measurements of this output over a short period are a substitute for the system’s steady state. By focusing on system behavior during testing, Chaos testing ensures that the system works.
Run Experiments: Identify all possible failure scenarios for the infrastructure, conduct controlled failure tests, and ensure a backup plan for each failure test. If the recovery plan is unknown, determine the system recovery path and record the actions taken during the recovery.
Analyze the test results: Check if the hypothesis was correct or if the expected stable behavior of the system has changed. It is the duty of the chaos engineer to ensure that the consequences of the chaos experiments are limited. Finally, analyze whether it affected service continuity, user experience, and resilience to other failures.
Chaos Testing Process
Specifically, to deal with the uncertainty of distributed systems at scale, Chaos Engineering can be considered to facilitate experiments designed to discover systemic vulnerabilities.
These experiments follow four steps:
- Begin by defining “state” as the output of some measurable system that exhibits normal behavior.
- This steady state is expected in both the control and experimental groups.
- Introduce variables that reflect server crashes, failed drives, network failures, etc.
- Try to disprove the hypothesis by finding a stable difference between the control and experimental groups.
The harder it is to disturb the steady state, the more reliability we have for the system. If a weakness is revealed, we now have a goal to fix before this behavior becomes system-wide.
Tools for Chaos Engineering and Chaos Testing
Chaos Monkey is the earliest chaos engineering tool. Chaos Monkey is a tool used to test the resilience of cloud systems by intentionally creating failures to understand how those systems react. Netflix made it to test the durability and stability of its AWS infrastructure. It was called Chaos Monkey because it wreaks havoc like a wild and armed monkey to try failures. Chaos Monkey and its related fault injector Simian Army focus on terminating virtual machine instances and reproducing unforeseen production events.
Chaos Mesh is an open-source tool that fits the development workflow and is easy to integrate into Kubernetes infrastructure without changing the deployment logic. Chaos Mesh is a chaos engineering solution that injects faults into every layer of the Kubernetes system. This includes directories, networks, system I/O, and kernel. Chaos Mesh can interfere with pod-to-pod communication and simulate read/write errors. These tests are configured using YAML files. Chaos Mesh also includes a dashboard for tracking experiments. The API made it easy to version, manage and automate chaotic experiments. The tool works with all major cloud platforms. The Cloud Native Computing Foundation recently accepted Chaos Mesh as part of the sandbox program. Chaos Mesh helped websites create a more resilient application.
Gremlin provides a platform for conducting chaotic experiments safely, simply, and securely. It offers software-as-a-service technology and is also used to test system resilience under various attack modes. Gremlin can automatically detect infrastructure components and provide test recommendations to see standard failure modes. The tool can also automatically stop experiments if systems become unstable. Many failure scenarios are available for testing, such as CPU attacks or system stress. Gremlin includes Kubernetes, AWS, Azure, Google Cloud integrations, and even bare metal infrastructure.
Litmus is an open-source platform and follows the principles of cloud-based chaotic engineering. It detects system errors through chaos tests and controlled experiments. The probing tool aims to provide a complete framework for discovering vulnerabilities in Kubernetes systems and applications running on Kubernetes. It has a chaos operator, and CRDs (CustomResourceDefinitions) enable plug-and-play functionality. It’s all about putting your chaos logic into a docker image, throwing it into the framework, and driving them with CRDs.
Chaos Toolkit is an open-source and straightforward tool for Chaos Engineering Experiment Automation. It has a set of standard chaos experiments to run, good introductory documentation, and supports significant cloud providers. The Chaos Toolkit creates a declarative API and makes it easy to code chaos experiments in a version control system in a way that a CI/CD system can automate. It includes drivers for Kubernetes, Prometheus, AWS, Google, Slack, Azure, and other chaos engineering tools like Gremlin.
Benefits of chaotic testing
Prepare for the unexpected: Chaos technology allows you to test your system for potential failures and use test data to harden your system against such failures.
Revealing the Unknowns: Chaotic testing helps you understand system behavior in the event of failure and show the path to recover subsystems
Reduced system downtime: One main advantage of chaos testing is that you can quickly investigate common and recurring failures by adding errors/defects. This helps to harden the system against known bugs.
Better customer satisfaction: Chaotic design prevents service interruptions by detecting interruptions at an early stage, which in turn improves the user experience.
Chaos Engineering is a competent and powerful practice already transforming software design and engineering in some of the world’s most significant operations. Chaotic testing creates the possibility of constantly but randomly causing failures in your production system. This practice aims to test the systems and environment’s resilience and determine the mean time to recovery (MTTR). Implementing chaos testing will help improve your MTTR and increase organizational confidence in the strength of your production environment.