Introducing Chaos Engineering
My name is Kataoka and I work in the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This time, I would like to write about what needs to be organized when introducing chaos engineering and what we can expect from it.
I would like to post a sequel about any observations or problems I encounter after actually using it.
What is Chaos Engineering?
Starting with the development of Netflix's Chaos Monkey, the idea of designing systems with failure in mind (Design for Failure) has spread.
With the rapid spread of public clouds and the rise of microservice configurations, systems are becoming more complex, and changes are occurring rapidly, increasing the possibility of unpredictable failures.
Chaos engineering involves injecting failures into a system (service) to identify its weaknesses in the face of unknown failures, with the aim of ensuring resilience.
While some places have already implemented it in production, there are still many places that have only implemented it up to staging, or that want to implement it but are facing opposition from those around them, and it appears that the implementation is not progressing smoothly due to various reasons.
The idea of injecting faults can raise concerns and objections, and it can sometimes lead to the image of causing random faults, but the injected faults themselves are controllable, so I think that concerns and objections can be overcome by conducting sufficient testing. If the sole purpose is to introduce it, it may cause unnecessary faults in the production environment, and the original meaning and purpose may be lost.
It will be a long road before it works, but once it does, we believe it will clarify the issues we need to solve and lead to improved service reliability.
Before Implementation
In this article, I will not talk about actually introducing it into the product, but I will talk about the OSS product of our company that I am introducing.BucketeerI met with the product owner of Bucketeer (a feature flag management and A/B testing platform) to discuss what we wanted to achieve through chaos engineering in order to improve the reliability of Bucketeer. Here is a brief summary of what we learned.。
- We want to provide specific grounds for SLI/SLO and ensure that they can be guaranteed at all times.
- We want to increase resilience and prove stability to users.
Chaos engineering also has many benefits, such as fault response training, reducing MTTR, vulnerability detection, SPOF verification, and being a good approach to solving problems that you intuitively think you have solved.
It defines the following five principles:
- Minimize Blast Radius
- Build a Hypothesis around Steady State Behavior
- Vary Real-World Events
- Run Experiments in Production
- Automate Experiments to Run Continuously
The book "Introduction to Chaos Engineering" breaks these down into steps. I've created a diagram of the process, which I'll attach below.

First of all, I felt that it was particularly important to align with the product side on the definition of steady state, the design of hypotheses, and the definition of variables.
I also believe that the following preparations are particularly important when advancing chaos engineering:
- There is no difference between the production and staging (testing) environments
- Logging, metrics, and tracing
- Load test environment (an environment where requests similar to those in production can be sent)
- Introducing SLI/SLO
Even if you think you are following the principles perfectly, if you neglect these things, all the effort you put into implementation will be wasted.
Conversely, by advancing the introduction of chaos engineering, it will be possible to identify and improve what is lacking in SRE activities.
This time, we decided to use Chaos Mesh because we wanted to implement it on services running on GKE.
I will also list some other chaos engineering tools that caught my eye.
Managed Service
- Gremlin
- AWS Fault Injection Simulator
- Azure Chaos Studio
Hosted Service
- Chaos Mesh
- Chaos Toolkit
- Litmus Chaos
- PowerfulSeal(Kraken)
This is not limited to chaos engineering, but tools do not solve everything, so I think it is important to broaden our perspective through chaos engineering activities and operations, such as reviewing our usual rules of thumb and habits, and to engage with the system as a team.
Trying out Chaos Mesh
Chaos MeshLet's try a simple failure locally using
Chaos Mesh allows you to inject various failures into Kubernetes and Hosts.
In particular, I would like to
- Simulate GCP Faults
- Simulate Pod Faults (simulate pod or container failure scenarios)
- Simulate Stress Scenarios (apply simple loads to CPU and memory)
- Simulate HTTP Faults (Simulates fault scenarios during HTTP requests and responses)
Around there.
As a test, let's try using Pod Kill with Pod Fault in an environment where three Nginx Pods are running.

Set the target Namespace to default, specify the target label app: nginx, and inject a failure that will randomly kill one Pod.

You will immediately see that the pod has been killed.

You can find out whether the cause of the Pod failure was actually due to Chaos Mesh by looking at the event.
Depending on the type of failure, you can specify the duration, run continuously, or schedule execution, so instead of just running it immediately and one-off, you can continue to inject failures periodically (for a long period of time) and combine it with load testing, which will make it easier to understand the behavior of the system.
This time we have given a very simple example, but you can deepen your understanding of the system by paying attention to how the system behaves, what happens to requests when there is constant access, what happens to latency, etc.
For this reason, the advance preparation and hypothesis design mentioned earlier are extremely important.
Conclusion
The following articles and books were extremely helpful when I started Chaos Engineering. If you're interested, please give them a read. I plan to continue learning and thinking so I don't leave it half-baked. From now on, I'd like to write articles about the things I've tested in practice and the things I've come across through operation.
*If you are interested in introducing chaos engineering or are already practicing it, we would love to exchange information with you.
reference
SRG is looking for people to work with us.
If you're interested, please contact us here.