Multi-region load testing using k6
This is Ohara (@No_oLimits) from the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
In this article, we will introduce a multi-region load test using k6 (k6-operator) conducted on a certain overseas service.
IntroductionLoad Testing Environment ArchitectureoverviewDetails - Scenario ExecutionMore information - MonitoringTroubleshootingk6 won't startk6 Died safely in OOMPrometheus load spikesWhat's good about using the k6Areas for improvementConclusion
Introduction
k6 is a load testing tool developed by grafana
Scenarios are written in JavaScript/TypeScript
In the service I'm in charge of, I only use it for API testing, but it can also be used for browser testing.
Supports WebSocket and gRPC protocols in addition to HTTP
It provides sufficient functionality in its basic state, but it also has many extensions available, allowing it to handle more complex load scenarios.
Load Testing Environment Architecture
overview

Deploy k6 in three regions and apply load to target systems in the corresponding regions
To run the k6 Operator, run it on GKE Autopilot.
I will leave out the details of the target system to be loaded.
Details - Scenario Execution
By installing k6-operator, you can use a resource called TestRun.
runner
Load scenarios are uploaded to GCS in advance.
When the k6 Pod starts, the initContainer retrieves the scenario from GCS, places it in the emptyDir, and references it from the k6 container.
Pods are given access to GCS by integrating with the GSA/KSA Workload Identity.
More information - Monitoring
k6 displays stats when the test is complete, but if you are running distributed tests, you will need to aggregate metrics.
This time we will build a monitoring environment with Prometheus + Grafana
The reason is that k6 supports sending metrics to prometheus remote write by default.
Sending metrics to other backends requires extensions

- Deploy Prometheus in each region
- k6 sends metrics to Prometheus in the region
- The Grafana data source is thanos-query
- Thanos Query requests to the Prometheus API via the Thanos sidecar
- Enables inter-cluster access of internal LB (*1
The key point is that we use thanos to aggregate cross-region Prometheus metrics.
thanos is a component for scaling Prometheus and has many other features.
This allows you to aggregate and monitor metrics across multiple regions.
*1) Make internal LB accessible across clusters
Thanos Query is configured to request to the LB IP
Troubleshooting
k6 won't start
It's especially addictive the first time you run it.
Basically nothing starts so I don't know the log.
But actually, k6-operator manager is spitting out logs.
If nothing starts, check the manager log.
And the most common reason why it doesn't start is that the manifest is specified incorrectly.
k6 Died safely in OOM
This time, the test was run with 500,000 virtual users.
I was trying to define multiple attributes for each user and control the user's behavior during testing using those attributes, but shortly after the test started, the pods started to stop one after another due to OOM.
This attribute definition was stored in a file, and the file was read at startup before the test was run. However, it appears that an OOM occurred when all VU threads in the k6 process read files for 500,000 users.
SharedArray
ShareArray is a memory space that can be shared between VUs. By setting attributes on this array when loading a file, memory usage can be significantly reduced, and OOM can be avoided.
Prometheus load spikes
There are probably two times when the load increases.
When running k6 and when issuing a query from Grafana
If you are sending too many metrics to Prometheus when running k6, it is standard practice to review your CPU and memory resources.
Also, in this state, a large number of messages about transmission failures should be output to the k6 log, so please check them.
It can be difficult to pinpoint the cause of Prometheus not responding when displaying graphs in Grafana, but the most common cause I've seen is insufficient grouping of metrics sent by k6.
As stated in the article below, k6 assigns tags to each request (this request is a request to the target system that puts load on it) by default.
This tag information is then sent to Prometheus along with the metrics data.
Prometheus groups and sorts queries from Grafana based on tag information, so querying a large number of ungrouped metrics can lead to high load.
In this case, the problem was solved by grouping the tag name value appropriately for each request.
By default, the value of name is set to the request URL, but since the URL contains a user ID, Prometheus interprets it as a unique value for each user, which is likely why it was unable to respond to queries from Grafana that group by name.
What's good about using the k6
k6-operator has a wide range of functions
At the time, I was wondering how to deploy the scenario to a k6 pod, when a new version of k6-operator was released and the initContainer spec was implemented in TestRun (which was a K6 resource at the time).
Additionally, we had to add an extension and build it to support Prometheus Remote-Write, but this is now also provided as a core feature.
It is frequently updated, so more and more features will be added in the future.
Areas for improvement
When performing this multi-region load test, I wanted to synchronize the test timing for each region, but it seemed there was no synchronization function, so I had to deploy at the right time and synchronize by force.
I thought it would be perfect if I could set the test date and time as an option at runtime.
Conclusion
We introduced multi-region load testing using k6.
I hope this article is helpful to you.
SRG is looking for people to work with us.
If you're interested, please contact us here.