Considering Best Practices for Chaos Engineering on Google Cloud - Tool Selection
This is Ohara from the Service Reliability Group (SRG) of the Media Division.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article introduces several tools for performing chaos engineering on Google Cloud.
IntroductionAnticipated failures and considerations for selecting chaos engineering toolsChaos ToolkitChaos MeshLitmusChaosWhich tool should I choose?summaryIn conclusion
Introduction
Currently, SRG is working on a mission to develop and implement chaos engineering solutions for its internal products.
Regarding the title, the reason it's limited to Google Cloud is that while AWS has managed chaos engineering services like Fault Injection Service and Azure has Chaos Studio, Google Cloud currently doesn't offer such services.
This document summarizes the available tools on Google Cloud and the functions of each tool.
Anticipated failures and considerations for selecting chaos engineering tools
Before we delve into comparing specific tools, we first need to clarify what kinds of problems we anticipate in our product and which components we will focus on in our experiments.
For example, the following scenario is possible:
- LB failure
- Pod failures within VMs and GKE clusters
- Network latency and loss
- Pod abnormal termination
- CPU and memory resource exhaustion
- Disk abnormality
- Specific region/zone outage
- Managed service outage
- DB failure
- Slow queries
- Disk abnormality
- Network anomalies (such as replication delays or errors)
- Failover
This is roughly a summary of the issues related to single failures. While scenarios involving multiple failures can also be considered, we won't consider complex ones for now in order to start small.
Next, we will organize the requirements for the tool.
Since the system configuration and other factors differ depending on the product you're responsible for, consider your environment accordingly.
- Supported components
- Does it support the target environment, such as GKE, GCE, LB, DB, etc.?
- How to manage fault injection scenarios
- Can experiments be defined in a way that is easy for the team to manage, such as using YAML/JSON files or a web UI?
- Operability
- How easy is it to implement, deploy, and manage, and what is the learning curve?
Based on these considerations, we narrowed it down to three tools as candidates.
- Chaos Toolkit
- Chaos Mesh
- LitmusChaos
Each of these will be explained below.
Chaos Toolkit
Chaos Toolkit is run via the command line interface (CLI). No additional resources are required in your experimental environment.
The experiment details are defined declaratively in a JSON file.
It offers extensions for various cloud services, enabling extensive fault injection. Predefined actions have a rollback function, so you won't forget to revert settings. It supports a wide range of components, such as fault injection into load balancers, Cloud SQL, and GCS, which are not available in Kubernetes-native chaos engineering tools.
You can define your own Python and shell scripts, as well as other executable code, within the experiment file, allowing for flexible management of the experiment. For example, you can define and execute processes to dynamically retrieve values within the experiment file, and manage them like templates by using JSON variables.
Since a web UI is not provided, you will need to be creative in managing experiments and visualizing results.
Please note that while there are extensions for Kubernetes fault injection, they require the installation of Chaos Mesh.
Chaos Mesh
Chaos Mesh is very well-known as a Kubernetes-native chaos engineering management platform.
Designed for use on Kubernetes, it can easily simulate container environment failures such as Pod shutdowns, network delays and disconnections, and I/O delays.
It can be easily installed using Helm for Operator installation, and custom resources can be defined and applied for each experiment type. To manage experimental YAML files using templates, consider using Helm or Kustomize.
The Chaos Dashboard allows you to create, run, monitor, and manage chaos experiments with intuitive controls.
This is the perfect choice if you want to easily and powerfully get started with chaos engineering in a GKE environment. Simplicity is key.
Furthermore, it offers an option for managing fault injection into Kubernetes using the aforementioned Chaos Toolkit.
LitmusChaos
LitmusChaos is similarly a Kubernetes-native chaos engineering management platform.
Basic fault injection functionality for Kubernetes is provided, equivalent to that of Chaos Mesh. In addition, a wide variety of chaos experiment templates created by the community are registered on the public repository called ChaosHub. You can use these to start experiments immediately.
While LitmusChaos also has an implementation method using Helm, it has more management components and is more complex compared to Chaos Mesh.
There are two types: Control Plane and Execution Plane. A single Control Plane can manage multiple clusters under test. The Control Plane provides a Web UI for managing workflow definitions (scenarios combining multiple experiment types), as well as features such as history and target cluster management. The Execution Plane is installed on the cluster where the chaos experiment actually takes place and is responsible for fault injection processing.
Experiments can be defined as Argo Workflows. Experiments can be executed sequentially or in parallel, and parameters can be specified externally using argo submit for template management. Furthermore, deployment of a Control Plane is not required; workflow execution is possible with only an Execution Plane, allowing you to start small.
It also supports GitOps, so I think it will be a good fit for teams that adopt a Kubernetes-native development style.
Which tool should I choose?
Of the three options, I decided to use Chaos Toolkit and LitmusChaos.
Here's a summary of my impressions of each tool:
| Tool name | Recruitment | reason | Suitable environments and use cases |
|---|---|---|---|
| Chaos Toolkit | 🆗 | Fault injection into the load balancer is possible. | If you want to include environments other than VM and Kubernetes. |
| Chaos Mesh | 🆖 | Increased cost when scaling due to daemon set deployment | This is ideal for situations where you want to conduct a variety of experiments using a GKE environment, with simple and intuitive operation. |
| LitmusChaos | 🆗 | It can be started on a small scale and managed on a large scale. Feature-rich | This is ideal for situations where you primarily use a GKE environment and need to manage multiple clusters or conduct complex experiments. |
Furthermore, we found that directly injecting failures into the managed service and database, which were listed as failure scenarios, using tools is difficult. Therefore, we decided to substitute this with experiments involving network connectivity failures and latency deterioration from application pods on Kubernetes.
summary
The best tools depend on the architecture of the target system and what you want to achieve through chaos engineering.
It's best to start with small-scale experiments to find the tools and operating methods that are best suited to your product.
In conclusion
This article introduced tools for implementing chaos engineering on Google Cloud.
Next time, I'll write about how to actually perform chaos testing using tools, and the problems I've encountered with those tools.
If you are interested in SRG, please contact us here.
