⚠️⚠️⚠️ STG SRG Portal ⚠️⚠️⚠️/Technical article/Considering Best Practices for Chaos Engineering on Google Cloud - Tool Selection

Considering Best Practices for Chaos Engineering on Google Cloud - Tool Selection

2025/8/6 19:462025/8/6 19:48

This is Ohara from the Service Reliability Group (SRG) of the Media Division.

#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).

This article introduces several tools for performing chaos engineering on Google Cloud.

Introduction Anticipated failures and considerations for selecting chaos engineering tools Chaos Toolkit Chaos Mesh LitmusChaos Which tool should I choose?summary In conclusion

Introduction

Currently, SRG is working on a mission to develop and implement chaos engineering solutions for its internal products.

Regarding the title, the reason it's limited to Google Cloud is that while AWS has managed chaos engineering services like Fault Injection Service and Azure has Chaos Studio, Google Cloud currently doesn't offer such services.

This document summarizes the available tools on Google Cloud and the functions of each tool.

Anticipated failures and considerations for selecting chaos engineering tools

Before we delve into comparing specific tools, we first need to clarify what kinds of problems we anticipate in our product and which components we will focus on in our experiments.

For example, the following scenario is possible:

LB failure

Pod failures within VMs and GKE clusters

Network latency and loss
Pod abnormal termination
CPU and memory resource exhaustion
Disk abnormality

Specific region/zone outage

Managed service outage

DB failure

Slow queries
Disk abnormality
Network anomalies (such as replication delays or errors)
Failover

This is roughly a summary of the issues related to single failures. While scenarios involving multiple failures can also be considered, we won't consider complex ones for now in order to start small.

Next, we will organize the requirements for the tool.

Since the system configuration and other factors differ depending on the product you're responsible for, consider your environment accordingly.

Supported components

Does it support the target environment, such as GKE, GCE, LB, DB, etc.?

How to manage fault injection scenarios

Can experiments be defined in a way that is easy for the team to manage, such as using YAML/JSON files or a web UI?

Operability

How easy is it to implement, deploy, and manage, and what is the learning curve?

Based on these considerations, we narrowed it down to three tools as candidates.

Chaos Toolkit

Chaos Mesh

LitmusChaos

Each of these will be explained below.

Chaos Toolkit

Chaos Toolkit is run via the command line interface (CLI). No additional resources are required in your experimental environment.

The experiment details are defined declaratively in a JSON file.

It offers extensions for various cloud services, enabling extensive fault injection. Predefined actions have a rollback function, so you won't forget to revert settings. It supports a wide range of components, such as fault injection into load balancers, Cloud SQL, and GCS, which are not available in Kubernetes-native chaos engineering tools.

You can define your own Python and shell scripts, as well as other executable code, within the experiment file, allowing for flexible management of the experiment. For example, you can define and execute processes to dynamically retrieve values within the experiment file, and manage them like templates by using JSON variables.

Since a web UI is not provided, you will need to be creative in managing experiments and visualizing results.

Please note that while there are extensions for Kubernetes fault injection, they require the installation of Chaos Mesh.

Concepts - Chaos Toolkit - The chaos engineering toolkit for developers

Chaos Toolkit

https://chaostoolkit.org/reference/concepts/

Chaos Mesh

Chaos Mesh is very well-known as a Kubernetes-native chaos engineering management platform.

Designed for use on Kubernetes, it can easily simulate container environment failures such as Pod shutdowns, network delays and disconnections, and I/O delays.

It can be easily installed using Helm for Operator installation, and custom resources can be defined and applied for each experiment type. To manage experimental YAML files using templates, consider using Helm or Kustomize.

The Chaos Dashboard allows you to create, run, monitor, and manage chaos experiments with intuitive controls.

This is the perfect choice if you want to easily and powerfully get started with chaos engineering in a GKE environment. Simplicity is key.

Furthermore, it offers an option for managing fault injection into Kubernetes using the aforementioned Chaos Toolkit.

Chaos Mesh Overview | Chaos Mesh

This document describes the concepts, use cases, core strengths, and the architecture of Chaos Mesh.

https://chaos-mesh.org/docs/#architecture-overview

LitmusChaos

LitmusChaos is similarly a Kubernetes-native chaos engineering management platform.

Basic fault injection functionality for Kubernetes is provided, equivalent to that of Chaos Mesh. In addition, a wide variety of chaos experiment templates created by the community are registered on the public repository called ChaosHub. You can use these to start experiments immediately.

While LitmusChaos also has an implementation method using Helm, it has more management components and is more complex compared to Chaos Mesh.

Architecture summary | Litmus Docs

---

https://docs.litmuschaos.io/docs/architecture/architecture-summary

There are two types: Control Plane and Execution Plane. A single Control Plane can manage multiple clusters under test. The Control Plane provides a Web UI for managing workflow definitions (scenarios combining multiple experiment types), as well as features such as history and target cluster management. The Execution Plane is installed on the cluster where the chaos experiment actually takes place and is responsible for fault injection processing.

Experiments can be defined as Argo Workflows. Experiments can be executed sequentially or in parallel, and parameters can be specified externally using argo submit for template management. Furthermore, deployment of a Control Plane is not required; workflow execution is possible with only an Execution Plane, allowing you to start small.

It also supports GitOps, so I think it will be a good fit for teams that adopt a Kubernetes-native development style.

Which tool should I choose?

Of the three options, I decided to use Chaos Toolkit and LitmusChaos.

Here's a summary of my impressions of each tool:

Tool name	Recruitment	reason	Suitable environments and use cases
Chaos Toolkit	🆗	Fault injection into the load balancer is possible.	If you want to include environments other than VM and Kubernetes.
Chaos Mesh	🆖	Increased cost when scaling due to daemon set deployment	This is ideal for situations where you primarily use a GKE environment and want to conduct a variety of experiments with simple and intuitive operation.
LitmusChaos	🆗	It can be started on a small scale and managed on a large scale. Feature-rich	This is ideal for situations where you primarily use a GKE environment and need to manage multiple clusters or conduct complex experiments.

Furthermore, we found that directly injecting failures into the managed service and database, which were listed as failure scenarios, using tools is difficult. Therefore, we decided to substitute this with experiments involving network connectivity failures and latency deterioration from application pods on Kubernetes.

summary

The best tools depend on the architecture of the target system and what you want to achieve through chaos engineering.

It's best to start with small-scale experiments to find the tools and operating methods that are best suited to your product.

In conclusion

This article introduced tools for implementing chaos engineering on Google Cloud.

Next time, I'll write about how to actually perform chaos testing using tools, and the problems I've encountered with those tools.

If you are interested in SRG, please contact us here.

Recruitment information - CyberAgent SRG #ca_srg

About SRG: SRG (Service Reliability Group) operates under the vision of "improving reliability across media businesses" and promotes the introduction of SRE into media businesses as a cross-functional SRE, working to improve reliability. Our work primarily revolves around the following three areas: Gathering and deploying technical know-how from each business.

https://ca-srg.dev/careers