HOME/Articles/Introducing Chaos Engineering

Introducing Chaos Engineering

2024/11/25 15:122024/12/13 16:41

My name is Kataoka and I work in the Service Reliability Group (SRG) of the Media Headquarters.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

This article isCyberAgent Group SRE Advent Calendar 2024This is the 14th article.

In this article, I would like to write about what needs to be organized when introducing chaos engineering and what we can expect from it.

I'll post a sequel about what I notice after actually using it and any problems I run into.

What is Chaos Engineering?Getting started Trying out Chaos Mesh Conclusion

What is Chaos Engineering?

Starting with the development of Netflix's Chaos Monkey, the idea of designing systems with failure in mind (Design for Failure) has become widespread.

With the rapid spread of public clouds and microservice configurations, systems are becoming more complex, and changes are occurring rapidly, increasing the possibility of unpredictable failures.

Chaos engineering involves injecting failures yourself, identifying weaknesses in a system (service) against unknown failures, and aiming to ensure resilience.

While some places have already implemented it in production, there are still many places that have only implemented it to the staging stage, or that want to implement it but are facing opposition from those around them, and it appears that the implementation is not progressing smoothly due to various reasons.

Injecting faults can raise concerns and objections, and can lead to the image of randomly causing faults, but the "faults to be injected themselves are controllable," so I think that with sufficient testing, such concerns and objections can be overcome. If the sole purpose is to introduce something, it may cause unnecessary faults in the production environment, and thus lose its original meaning and purpose.

It will be a long road before it actually works, but once it does, we believe it will clarify the issues we need to solve and lead to improved service reliability.

Getting started

In this article, I will not talk about actually introducing it into the product, but I will talk about the OSS product of our company that I am introducing.BucketeerI had a discussion with the product owner of Bucketeer (a feature flag management and A/B testing platform) about what we wanted to achieve through chaos engineering in order to increase Bucketeer’s reliability, so I’ll summarize it here.。

I want to provide specific reasons for SLI/SLO and ensure that they can be guaranteed at all times.

We want to increase resilience and demonstrate stability to users

Chaos engineering has many benefits, such as troubleshooting training, reducing MTTR, detecting vulnerabilities, verifying SPOFs, and being a good approach to problems that you think you have intuitively solved.

CyberAgent releases OSS for feature flag management and A/B testing platform "Bucketeer"

We post our company's news releases, including information about CyberAgent's new initiatives and services.

https://www.cyberagent.co.jp/news/detail/id=28068

First, the important thing to proceed isChaos Engineering PrinciplesI think this will be helpful.

It defines five principles:

Minimize Blast Radius

Build a Hypothesis around Steady State Behavior

Vary Real-world Events

Run Experiments in Production

Automate Experiments to Run Continuously

A book published by C&R Research Institute:Introduction to Chaos Engineering" breaks these down into smaller steps. I've created a diagram of how to put it into practice, which I'll attach below.

First of all, I felt it was particularly important to align with the product side on the definition of steady state, design of hypotheses, and definition of variables.

In addition, I believe that the following preparations are particularly important when advancing chaos engineering:

There is no difference between the production and staging (testing) environments

Logging, metrics, and tracing

Load test environment (an environment where requests can be sent in the same way as in production)

Introducing SLI/SLO

Even if you think you are following the principles perfectly, if you neglect these things, all the effort you put into the implementation will be wasted.

Conversely, the introduction of chaos engineering will help identify what is lacking in SRE activities and lead to their improvement.

This time, we adopted Chaos Mesh because we wanted to introduce it to services running on GKE.

Here are some other chaos engineering tools that caught my eye:

Managed Service

Gremlin

AWS Fault Injection Simulator

Azure Chaos Studio

Hosted Service

Chaos Mesh

Chaos Toolkit

Litmus Chaos

PowerfulSeal（Kraken）

This isn't limited to chaos engineering, but tools don't solve everything, so I think it's important for the whole team to broaden their perspective and face the system together through chaos engineering activities and operations, such as by reviewing their usual rules of thumb and habits.

Trying out Chaos Mesh

Chaos MeshLet's try out some simple failures locally using

Chaos Mesh

Chaos Mesh brings various types of fault simulation to Kubernetes and has an enormous capability to orchestrate fault scenarios. It helps you conveniently simulate various abnormalities that might occur in reality during the development, testing, and production environments and find potential problems in the system.

https://chaos-mesh.org/

Chaos Mesh allows you to inject various failures into Kubernetes and Hosts.

In particular, I would like to

Simulate GCP Faults

Simulate Pod Faults

Simulate Stress Scenarios (apply simple load to CPU and memory)

Simulate HTTP Faults

Around there.

As an experiment, let's try using Pod Kill with Pod Fault in an environment where three Nginx Pods are running.

Set the target Namespace to default, specify the target label app: nginx, and inject a failure that will randomly kill one Pod.

You will immediately see that the pod has been killed.

You can determine whether the cause of the Pod failure was truly due to Chaos Mesh by looking at the event.

Depending on the type of fault, you can specify the duration, run continuously, or schedule the execution, so instead of just running it immediately and one-off, you can continue to inject faults periodically (for a long period of time) and combine it with load testing, which will make it easier to understand the behavior of the system.

This time I have given a very simple example, but you can deepen your understanding of the system by paying attention to how the system behaves, what happens to requests when there is constant access, what happens to the latency, etc.

For this reason, the advance preparations and hypothesis design mentioned earlier are extremely important.

Conclusion

The following articles and books were very helpful in getting started with chaos engineering. If you are interested, please read them. I would like to continue learning and thinking so that I don't end it halfway. From now on, I would like to write articles about what I have verified in practice and what I have gotten myself into through operation.

*If you are interested in introducing chaos engineering or are already practicing it, we would love to exchange information with you.

reference

Chaos Engineering

As software and all other systems develop, they inevitably become more complex. This book explains the basic theory and principles of chaos engineering and explains how organizations can embrace complexity, discover weaknesses in systems, and develop the ability to deal with failures with confidence. It introduces examples from Slack, Google, Microsoft, LinkedIn, and Capital One, companies where software is the foundation of their business, and describes the implementation of chaos engineering programs centered around game days, the challenges of selecting and automating experiments, the design and implementation of continuous verification, and examples of application to databases and security. In addition to the author, a pioneer who launched a chaos engineering team at Netflix, leaders of various organizations provide a multifaceted explanation of chaos engineering, making this a must-have book not only for engineers but also for anyone involved in any "complex system."

https://www.oreilly.co.jp//books/9784873119885/

Book details | C&R Research Institute Co., Ltd.

C&R Research Institute is a publishing company founded in 1991. We strive every day to publish quality books that meet the needs of our readers, focusing on computer and business books.

https://www.c-r.com/book/detail/1443

Principles of chaos engineering

Chaos engineering is a discipline that involves running experiments on a system to gain confidence in its ability to withstand instability in a production environment.

https://principlesofchaos.org/ja/

Chaos Engineering – Netflix TechBlog