Introducing Chaos Engineering

My name is Kataoka and I work in the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article isCyberAgent Group SRE Advent Calendar 2024This is the 14th article.
In this article, I would like to write about what needs to be organized when introducing chaos engineering and what we can expect from it.
I'll post a sequel about what I notice after actually using it and any problems I run into.

What is Chaos Engineering?


Starting with the development of Netflix's Chaos Monkey, the idea of designing systems with failure in mind (Design for Failure) has become widespread.
With the rapid spread of public clouds and microservice configurations, systems are becoming more complex, and changes are occurring rapidly, increasing the possibility of unpredictable failures.
Chaos engineering involves injecting failures yourself, identifying weaknesses in a system (service) against unknown failures, and aiming to ensure resilience.
 
While some places have already implemented it in production, there are still many places that have only implemented it to the staging stage, or that want to implement it but are facing opposition from those around them, and it appears that the implementation is not progressing smoothly due to various reasons.
Injecting faults can raise concerns and objections, and can lead to the image of randomly causing faults, but the "faults to be injected themselves are controllable," so I think that with sufficient testing, such concerns and objections can be overcome. If the sole purpose is to introduce something, it may cause unnecessary faults in the production environment, and thus lose its original meaning and purpose.
 
It will be a long road before it actually works, but once it does, we believe it will clarify the issues we need to solve and lead to improved service reliability.
 

Getting started


In this article, I will not talk about actually introducing it into the product, but I will talk about the OSS product of our company that I am introducing.BucketeerI had a discussion with the product owner of Bucketeer (a feature flag management and A/B testing platform) about what we wanted to achieve through chaos engineering in order to increase Bucketeer’s reliability, so I’ll summarize it here.
  • I want to provide specific reasons for SLI/SLO and ensure that they can be guaranteed at all times.
  • We want to increase resilience and demonstrate stability to users
Chaos engineering has many benefits, such as troubleshooting training, reducing MTTR, detecting vulnerabilities, verifying SPOFs, and being a good approach to problems that you think you have intuitively solved.
 
First, the important thing to proceed isChaos Engineering PrinciplesI think this will be helpful.
It defines five principles:
  1. Minimize Blast Radius
  1. Build a Hypothesis around Steady State Behavior
  1. Vary Real-world Events
  1. Run Experiments in Production
  1. Automate Experiments to Run Continuously
 
A book published by C&R Research Institute:Introduction to Chaos Engineering" breaks these down into smaller steps. I've created a diagram of how to put it into practice, which I'll attach below.
 
First of all, I felt it was particularly important to align with the product side on the definition of steady state, design of hypotheses, and definition of variables.
In addition, I believe that the following preparations are particularly important when advancing chaos engineering:
  • There is no difference between the production and staging (testing) environments
  • Logging, metrics, and tracing
  • Load test environment (an environment where requests can be sent in the same way as in production)
  • Introducing SLI/SLO
 
Even if you think you are following the principles perfectly, if you neglect these things, all the effort you put into the implementation will be wasted.
Conversely, the introduction of chaos engineering will help identify what is lacking in SRE activities and lead to their improvement.
 
This time, we adopted Chaos Mesh because we wanted to introduce it to services running on GKE.
Here are some other chaos engineering tools that caught my eye:
Managed Service
  • Gremlin
  • AWS Fault Injection Simulator
  • Azure Chaos Studio
Hosted Service
  • Chaos Mesh
  • Chaos Toolkit
  • Litmus Chaos
  • PowerfulSeal(Kraken)
 
This isn't limited to chaos engineering, but tools don't solve everything, so I think it's important for the whole team to broaden their perspective and face the system together through chaos engineering activities and operations, such as by reviewing their usual rules of thumb and habits.
 

Trying out Chaos Mesh


Chaos MeshLet's try out some simple failures locally using
Chaos Mesh allows you to inject various failures into Kubernetes and Hosts.
In particular, I would like to
  • Simulate GCP Faults
  • Simulate Pod Faults
  • Simulate Stress Scenarios (apply simple load to CPU and memory)
  • Simulate HTTP Faults
Around there.
 
As an experiment, let's try using Pod Kill with Pod Fault in an environment where three Nginx Pods are running.
Set the target Namespace to default, specify the target label app: nginx, and inject a failure that will randomly kill one Pod.
You will immediately see that the pod has been killed.
 
You can determine whether the cause of the Pod failure was truly due to Chaos Mesh by looking at the event.
 
Depending on the type of fault, you can specify the duration, run continuously, or schedule the execution, so instead of just running it immediately and one-off, you can continue to inject faults periodically (for a long period of time) and combine it with load testing, which will make it easier to understand the behavior of the system.
This time I have given a very simple example, but you can deepen your understanding of the system by paying attention to how the system behaves, what happens to requests when there is constant access, what happens to the latency, etc.
For this reason, the advance preparations and hypothesis design mentioned earlier are extremely important.
 

Conclusion


The following articles and books were very helpful in getting started with chaos engineering. If you are interested, please read them. I would like to continue learning and thinking so that I don't end it halfway. From now on, I would like to write articles about what I have verified in practice and what I have gotten myself into through operation.
*If you are interested in introducing chaos engineering or are already practicing it, we would love to exchange information with you.
 
reference
 
SRG is looking for people to work with us. If you are interested, please contact us here.