HOME/Articles/Multi-region load testing using k6

Multi-region load testing using k6

2024/11/25 15:272024/12/14 19:08

This article isCyberAgent Group SRE Advent Calendar 2024This is the 15th article.

This is Ohara (@No_oLimits) from the Service Reliability Group (SRG) of the Media Headquarters.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

In this article, we will introduce a multi-region load test using k6 (k6-operator) conducted on a certain overseas service.

Introduction Load Test Environment Architecture overview Details - Scenario Execution More Info - Monitoring Troubleshooting k6 won't start k6 Died safely from OOM Prometheus load spikes What I liked about using the k6 Areas for improvement Conclusion

Introduction

k6 is a load testing tool developed by grafana.

Scenarios are written in JavaScript/TypeScript.

In the service I'm in charge of, I only use it for API testing, but it can also be used for browser testing.

Supports WebSocket and gRPC protocols in addition to HTTP

Although it offers sufficient functionality in its basic form, it also has many extensions available to handle more complex load scenarios.

Load Test Environment Architecture

overview

Deploy k6 in three regions and apply load to target systems in the corresponding regions

To run the k6 Operator, run it on GKE Autopilot.

I will leave out the details of the target system to be loaded.

Details - Scenario Execution

By installing k6-operator, you can use a resource called TestRun.

runner

Load scenarios are uploaded to GCS in advance.

When the k6 Pod starts, initContainer retrieves the scenario from GCS, places it in the emptyDir, and references it from the k6 container.

Pods are given access to GCS by integrating with GSA/KSA's Workload Identity.

More Info - Monitoring

k6 displays stats when tests are completed, but if you are running distributed tests, you will need to aggregate metrics.

This time we will build a monitoring environment with Prometheus + Grafana

The reason is that k6 supports sending metrics to prometheus remote write by default.

Sending metrics to other backends requires extensions

Deploy Prometheus in each region

k6 sends metrics to Prometheus in the region

The Grafana data source is thanos-query

Thanos Query requests are made to the Prometheus API via the Thanos sidecar.

Enables inter-cluster access of internal LB (*1

The key point is that we use thanos to aggregate cross-region prometheus metrics.

Thanos is a component for scaling Prometheus and has many other features.

Thanos Metrics

Highly available Prometheus setup with long term storage capabilities.

https://thanos.io/tip/thanos/quick-tutorial.md/#components

This allows you to aggregate and monitor metrics across multiple regions.

*1) Make the internal LB accessible between clusters

Thanos Query is configured to request to the LB IP

Troubleshooting

k6 won't start

It's especially annoying when you first run it.

Basically nothing is running so I can't see the logs.

But actually, k6-operator manager is spitting out the log.

If nothing starts, check the manager log.

And the most common reason why it doesn't start is because the manifest is specified incorrectly.

k6 Died safely from OOM

This time, the test was run with 500,000 virtual users.

When I was trying to define multiple attributes for each user and control the user's behavior during testing using those attributes, the pods started to stop one after another due to OOM a while after the test started.

This attribute definition was stored in a file, and the test was run after loading the file at startup, but it seems that an OOM occurred when all VU threads in the k6 process loaded files for 500,000 users.

SharedArray

ShareArray is a memory space that can be shared between VUs. By setting attributes on this array when reading a file, memory usage can be significantly reduced and OOM can be avoided.

SharedArray | Grafana k6 documentation

SharedArray is an array-like object that shares the underlying memory between VUs.

https://grafana.com/docs/k6/latest/javascript-api/k6-data/sharedarray/

Prometheus load spikes

There are probably two times when the load increases.

When running k6 and when issuing a query from Grafana

If you are sending too many metrics to Prometheus when running k6, it is standard practice to review your CPU and memory resources.

Also, in this state, there should be a large number of messages about failed transmissions in the k6 log, so check them.

It is difficult to pinpoint the cause of Prometheus not responding when displaying graphs in Grafana, but the most common cause I have seen is insufficient grouping of metrics sent by k6.

As stated in the article below, k6 assigns tags to each request (this request is a request to the target system that is being loaded) by default.

Then we send that tag information along with the metrics data to Prometheus.

Tags and Groups | Grafana k6 documentation

k6 provides the Tags and Groups APIs to help you during the analysis and easily visualize, sort and filter your test results.

https://grafana.com/docs/k6/latest/using-k6/tags-and-groups/

Prometheus should group and sort queries from Grafana based on tag information, so querying a large number of ungrouped metrics can lead to high load.

In this pattern, the problem was solved by grouping the tag name value appropriately for each request.

By default, the value of name is set to the request URL, but since the URL contains a user ID, Prometheus interprets it as a unique value for each user, which is presumably why it was unable to respond to queries from Grafana such as grouping by name.

What I liked about using the k6

k6-operator has many functions

At the time, I was wondering how to place the scenario in a k6 pod, and then a new version of k6-operator was released and the initContainer spec was implemented in TestRun (which was a K6 resource at the time).

In addition, we had to add an extension and build it to support Prometheus Remote-Write, but this is now also provided as a core feature.

It is frequently updated, so more features will be added in the future.

Areas for improvement

When conducting this multi-region load test, I wanted to synchronize the test timing for each region, but it seemed that there was no function to synchronize, so I had to wait for the right timing, deploy, and synchronize.

I thought it would be perfect if I could set the test date and time as an option at run time.

Conclusion

We introduced multi-region load testing using k6.

I hope this article is helpful to you.

SRG is looking for people to work with us. If you are interested, please contact us here.

Recruitment information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SREs to the media business as a cross-sectional SRE under the vision of "improving reliability across the media business." The work is centered around the following three pillars: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers