Considering Chaos Engineering Best Practices on Google Cloud - Tool Selection

My name is Ohara and I work in the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article introduces some tools for implementing chaos engineering on Google Cloud.
 

Introduction


Currently, SRG is working on a mission to develop and implement chaos engineering solutions into its internal products.
Regarding the title, the reason why I'm limiting it to Google Cloud is that while AWS has managed chaos engineering services such as Fault Injection Service and Azure has Chaos Studio, Google Cloud does not currently offer such services.
We will summarize what tools are available on Google Cloud and the functions of each tool.
 

Anticipated failures and considerations for selecting chaos engineering tools


Before comparing specific tools, we first need to clarify what kind of problems we anticipate in our product and which components we will test.
For example, consider the following scenario:
  • LB disorder
  • VM or Pod failure in a GKE cluster
    • Network delays and losses
    • Pod abnormal termination
    • CPU and memory resource exhaustion
    • Disk abnormality
  • Specific region or zone failure
  • Managed Service Outage
  • Database failure
    • Slow queries
    • Disk abnormality
    • Network abnormalities (replication delays, errors, etc.)
    • Failover
This is a rough summary of the individual failures. There are also scenarios that combine multiple failures, but we will start small and not consider complex issues.
 
Next, we will organize the requirements for the tool.
The system configuration will differ depending on the product you are responsible for, so consider it based on your environment.
  • Supporting Components
    • Does it support the target environment, such as GKE, GCE, LB, or DB?
  • How to manage fault injection scenarios
    • Can you define experiments in a way that is easy for your team to operate, such as YAML/JSON files or a web UI?
  • Operability
    • How easy is it to implement, deploy, and manage, and what is the learning curve?
 
Based on these considerations, we have selected three tools as candidates.
  • Chaos Toolkit
  • Chaos Mesh
  • LitmusChaos
Each is explained below.
 

Chaos Toolkit

Chaos Toolkit runs via a CLI and does not require any additional resources in the environment you are experimenting with.
Experiment content is declaratively defined in a JSON format file.
It has extensions for various cloud services and allows for a wide range of fault injections. Predefined actions have a rollback function, so you won't forget to revert settings. It supports a wide range of components, such as fault injection into LB, Cloud SQL, and GCS, which are not available in Kubernetes-native chaos engineering tools.
You can define your own Python scripts, shell scripts, and other executable code in the experiment file, allowing you to flexibly manage the contents of your experiments. For example, you can define and execute processes that dynamically obtain values in the experiment file, and you can also convert them into variables in JSON and manage them like templates.
Since no web UI is provided, some ingenuity is required to manage experiments and visualize results.
Note that there is an extension for Kubernetes fault injection, but it requires the introduction of Chaos Mesh.
 

Chaos Mesh

Chaos Mesh is a well-known Kubernetes-native chaos engineering management platform.
It is designed for use on Kubernetes and can easily simulate failures in a container environment, such as pod outages, network delays/disconnections, and I/O delays.
It's easy to deploy using Helm for installing the Operator, and you can define and apply custom resources for each experiment type. Consider using Helm or Kustomize to manage experiment yaml templates.
The Chaos Dashboard provides an intuitive way to create, run, monitor, and manage chaos experiments.
It's the perfect choice if you want to start chaos engineering in a GKE environment easily and powerfully. Simple is justice.
It is also an option when managing fault injection into k8s with the aforementioned Chaos Toolkit.
 

LitmusChaos

Similarly, LitmusChaos is a Kubernetes-native chaos engineering management platform.
The basic Kubernetes fault injection functionality is equivalent to that of Chaos Mesh. In addition, a wide variety of chaos experiment templates created by the community are registered in a public repository called ChaosHub. You can use these to start experiments right away.
LitmusChaos also has a Helm implementation, but it has more management components and is more complex than Chaos Mesh.
There are two types of planes: a Control Plane and an Execution Plane. One Control Plane can manage multiple clusters to be tested. The Control Plane provides a Web UI for managing workflow definitions (scenarios that combine multiple experiment types), as well as features such as history and target cluster management. The Execution Plane is installed on the cluster where the chaos experiment actually takes place, and is responsible for fault injection processing.
Experiments can be defined as Argo Workflows. Experiments can be run sequentially or in parallel, and templates can be managed by specifying parameters externally using Argo Submit. Furthermore, deploying the Control Plane is not required; workflows can be executed using only the Execution Plane, allowing you to start small.
It also supports GitOps, so I think it will be easy to use for teams that adopt a Kubernetes-native development style.
 

Which tool to choose?


Of the three options, we decided to use Chaos Toolkit and LitmusChaos.
Here's a summary of my thoughts on each tool:
Tool NameRecruitmentreasonSuitable environments and use cases
Chaos Toolkit🆗Fault injection into LB is possibleIf you want to include environments other than VMs and Kubernetes.
Chaos Mesh🆖Increased costs at scale due to daemon set deploymentIf you are primarily using the GKE environment and want to perform a variety of experiments with simple and intuitive operations.
LitmusChaos🆗You can start small and manage large scale operations. Feature-richIf you are primarily using a GKE environment and want to manage multiple clusters or perform complex experiments.
We also found that it was difficult to directly inject faults into the managed services and database failures listed as failure scenarios using tools. Therefore, we decided to substitute these with experiments such as network connectivity issues and latency degradation from application pods on Kubernetes.
 

summary


The best tool to use depends on the architecture of your system and what you want to achieve through chaos engineering.
It's a good idea to start with small experiments and find the tools and practices that work best for your product.
 

Conclusion


We introduced tools for implementing chaos engineering on Google Cloud.
Next time, I'll write about how to actually conduct chaos testing using tools and the problems I encountered with the tools.
 
If you are interested in SRG, please contact us here.