Multi-region load testing using k6
This is Ohara (@No_oLimits) from the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
In this article, we will introduce a multi-region load test using k6 (k6-operator) conducted on a certain overseas service.
IntroductionLoad Test Environment ArchitectureoverviewDetails - Scenario ExecutionMore Info - MonitoringTroubleshootingk6 won't startk6 Died safely from OOMPrometheus load spikesWhat I liked about using the k6Areas for improvementConclusion
Introduction
k6 is a load testing tool developed by grafana.
Scenarios are written in JavaScript/TypeScript.
In the service I'm in charge of, I only use it for API testing, but it can also be used for browser testing.
Supports WebSocket and gRPC protocols in addition to HTTP
Although it offers sufficient functionality in its basic form, it also has many extensions available to handle more complex load scenarios.
Load Test Environment Architecture
overview

Deploy k6 in three regions and apply load to target systems in the corresponding regions
To run the k6 Operator, run it on GKE Autopilot.
I will leave out the details of the target system to be loaded.
Details - Scenario Execution
By installing k6-operator, you can use a resource called TestRun.
runner
Load scenarios are uploaded to GCS in advance.
When the k6 Pod starts, initContainer retrieves the scenario from GCS, places it in the emptyDir, and references it from the k6 container.
Pods are given access to GCS by integrating with GSA/KSA's Workload Identity.
More Info - Monitoring
k6 displays stats when tests are completed, but if you are running distributed tests, you will need to aggregate metrics.
This time we will build a monitoring environment with Prometheus + Grafana
The reason is that k6 supports sending metrics to prometheus remote write by default.
Sending metrics to other backends requires extensions

- Deploy Prometheus in each region
- k6 sends metrics to Prometheus in the region
- The Grafana data source is thanos-query
- Thanos Query requests are made to the Prometheus API via the Thanos sidecar.
- Enables inter-cluster access of internal LB (*1
The key point is that we use thanos to aggregate cross-region prometheus metrics.
Thanos is a component for scaling Prometheus and has many other features.
This allows you to aggregate and monitor metrics across multiple regions.
*1) Make the internal LB accessible between clusters
Thanos Query is configured to request to the LB IP
Troubleshooting
k6 won't start
It's especially annoying when you first run it.
Basically nothing is running so I can't see the logs.
But actually, k6-operator manager is spitting out the log.
If nothing starts, check the manager log.
And the most common reason why it doesn't start is because the manifest is specified incorrectly.
k6 Died safely from OOM
This time, the test was run with 500,000 virtual users.
When I was trying to define multiple attributes for each user and control the user's behavior during testing using those attributes, the pods started to stop one after another due to OOM a while after the test started.
This attribute definition was stored in a file, and the test was run after loading the file at startup, but it seems that an OOM occurred when all VU threads in the k6 process loaded files for 500,000 users.
SharedArray
ShareArray is a memory space that can be shared between VUs. By setting attributes on this array when reading a file, memory usage can be significantly reduced and OOM can be avoided.
Prometheus load spikes
There are probably two times when the load increases.
When running k6 and when issuing a query from Grafana
If you are sending too many metrics to Prometheus when running k6, it is standard practice to review your CPU and memory resources.
Also, in this state, there should be a large number of messages about failed transmissions in the k6 log, so check them.
It is difficult to pinpoint the cause of Prometheus not responding when displaying graphs in Grafana, but the most common cause I have seen is insufficient grouping of metrics sent by k6.
As stated in the article below, k6 assigns tags to each request (this request is a request to the target system that is being loaded) by default.
Then we send that tag information along with the metrics data to Prometheus.
Prometheus should group and sort queries from Grafana based on tag information, so querying a large number of ungrouped metrics can lead to high load.
In this pattern, the problem was solved by grouping the tag name value appropriately for each request.
By default, the value of name is set to the request URL, but since the URL contains a user ID, Prometheus interprets it as a unique value for each user, which is presumably why it was unable to respond to queries from Grafana such as grouping by name.
What I liked about using the k6
k6-operator has many functions
At the time, I was wondering how to place the scenario in a k6 pod, and then a new version of k6-operator was released and the initContainer spec was implemented in TestRun (which was a K6 resource at the time).
In addition, we had to add an extension and build it to support Prometheus Remote-Write, but this is now also provided as a core feature.
It is frequently updated, so more features will be added in the future.
Areas for improvement
When conducting this multi-region load test, I wanted to synchronize the test timing for each region, but it seemed that there was no function to synchronize, so I had to wait for the right timing, deploy, and synchronize.
I thought it would be perfect if I could set the test date and time as an option at run time.
Conclusion
We introduced multi-region load testing using k6.
I hope this article is helpful to you.
SRG is looking for people to work with us. If you are interested, please contact us here.