Load testing was performed using the production environment.

My name is Taninari and I work in the Service Reliability Group (SRG) of the Technology Headquarters.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article describes my experience conducting load testing to understand capacity by switching features.
 
 

About this article


When we were revamping a certain system, we were working on a project to create new functions and have users switch from old functions to the new ones.
Since many services were using it, we requested the switchover in order and proceeded with the switchover.
While the first few services were switching over, functionality was provided without any problems. However, after some services had completed the switchover, we began to experience increased latency and error rates for certain services during peak usage periods.
We decided to conduct load testing to investigate and resolve the increased error rate and latency.

Project Progress


Okay, let’s create a load test environment.

We decided to start by testing in a readily available development environment.
This time to apply the loadlocustwas used.
For the load scenarios, we calculated the percentage of endpoints that were being hit from the actual system, and prepared load scenarios that hit each endpoint in a similar percentage.

Load testing in the development environment

The premise is that the system in question is operated on Kubernetes.
An error occurred when I applied load in the development environment.
The system in question isI use JavaTo investigate the error, we took a thread dump to find out where the processing was clogged, and fixed the suspicious part.
SystemProblem after fixThe load test was carried out again to check for any defects.
After the fix, the error no longer occurred, but the RPS no longer increased beyond a certain load.
When we investigated the cause, we found that a system in a different environment with which the system in question was communicating was causing a bottleneck, resulting in the system not being under sufficient load.

After completing load testing in the development environment

In order to apply integrated load to the development environment, it was necessary to also strengthen the systems in a separate environment.
There are multiple systems in different environments, and even if resources are increased, expansion costs and costs for the additional resources are incurred.
Furthermore, adding multiple separate environments will take a considerable amount of time.

Judging whether to use a production environment for testing

The part that was causing the error had been fixed through testing in the development environment, and it was assumed that the requests would actually be processed; however, if the project were to receive an equivalent load and return an error, it would cause inconvenience to the project that would need to switch over, so we needed to check whether the requests could actually be processed.
As mentioned above, it was difficult to apply sufficient load to the development environment, so we decided to test the production environment outside of peak hours and actually apply load to the production environment.

Preparation for the actual exam

Of course, when using a production environment, it is unacceptable that the service would go down due to excessive load.
So, in order to get a more accurate understanding of the status of our services, we reviewed the metrics we normally collect, created dashboards that we felt were necessary, and output detailed logs.
This makes it easier to grasp the status of the service more quickly and concisely than ever before.

Let's load test

The number of requests when the initial error occurred was known from the logs and metrics that we normally collect. Based on that, we estimated the maximum number of requests that would ultimately be received, and then decided on the amount of requests to apply as a load, taking into consideration a little more.
Now that we are ready, we will carry out the actual load test.
The team gets together and applies load from the load test environment they created during off-peak hours.
When we actually applied load, we found that the resources of the backend service for the relevant function were running out.
We increased the relevant resources (because the service in question is an old service that runs on a VM and does not have features such as auto-scaling) and applied load again, and were able to confirm that the error rate did not increase and the requests were returned normally, even when the expected number of requests was received.

Review after the actual load test


The need to conduct load testing this time arose because the capacity of each system was not yet understood.
The main reason for this was that there was no documentation on how many requests could be handled if given a certain amount of resources. (Although there may be fewer opportunities to worry about this these days, as we use auto-scaling systems, etc.)
Taking this into consideration, we first conducted load testing in a development environment to verify how many requests the target service could handle in one pod, and whether the number of requests that could be handled would scale when the service was scaled. We also worked to document the results.
In addition, we were able to see things and make improvements by carrying out the load testing. Based on that, we held a meeting as a team to prioritize the issues we needed to address and deal with them in order.

Conclusion


This time, we looked back on how we thought about and carried out load testing given the limited time and resources available.
Due to the following factors, we were unable to grasp how much load the system could withstand.
  • No load testing was done when it was created
  • Load testing was performed but no documentation was kept
  • I kept it in the document but it was lost due to the transition of the document tool used.
As a result of this project, we were able to create a document that clearly states how much performance the system can achieve with how many resources it needs by correcting bottlenecks through load testing.
At the moment, we are storing documents in a common document repository that is shared by all team members, but there are still issues with documents not being carried over when teams change, or being lost when the documentation tool we use is changed.
It seems we need to continue thinking about this issue going forward.
 
SRG is looking for people to work with us. If you are interested, please contact us here.
 
SRG runs a podcast where we chat about the latest hot topics in IT and books. We hope you will enjoy listening to it while you work.