Load testing using the production environment
My name is Taninari and I work in the Service Reliability Group (SRG) of the Technology Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article documents my experience conducting load testing to understand capacity by switching between features.
About this articleProject ProgressOkay, let's create a load testing environment.Load testing in the development environmentAfter completing the load test in the development environmentJudging the test using a production environmentPreparation for the actual examLet's load testReview after the actual load testConclusion
About this article
When we were revamping a certain system, we were working on a project to create new functions and have users switch from old functions to new functions.
Since many services use it, we requested the switchover in order and proceeded with the switchover.
While the first few services were switching over, functionality was provided without any issues. However, after some services had completed the switchover, we began to experience increased latency and error rates when certain services reached peak usage times.
So we decided to conduct load testing to investigate and resolve the increased error rate and latency.
Project Progress
Okay, let's create a load testing environment.
We decided to start by testing in a readily available development environment.
For the load scenario, we calculated the percentage of endpoints that were being hit from the actual system and prepared a load scenario that would hit each endpoint at a similar rate.
Load testing in the development environment
The premise is that the system in question is operated on Kubernetes.
An error occurred when I applied load in the development environment.
The system in question isIt uses JavaTo investigate the error, we took a thread dump to investigate where the processing was clogged and fixed the suspicious part.。
SystemProblem after fixThe load test was carried out again to check for any issues.
After the fix, the error no longer occurred, but the RPS no longer increased beyond a certain load.
When we investigated the cause, we found that a system in a different environment that the system in question was communicating with was causing a bottleneck, resulting in insufficient load being placed on the system.
After completing the load test in the development environment
In order to apply integrated load to the development environment, it was necessary to also strengthen systems in a separate environment.
There are multiple systems in different environments, not just one, and even if resources are increased, expansion costs and costs for the additional resources will be incurred.
Furthermore, if you add multiple separate environments, it will take a considerable amount of time.
Judging the test using a production environment
The part that was causing the error was fixed through testing in the development environment, and it was assumed that the request could actually be handled, but since it would cause inconvenience to the project that would need to switch over by returning an error when it was actually subjected to the same load, it was necessary to confirm that the request could actually be handled.
As mentioned above, it was difficult to apply sufficient load to the development environment, so we decided to test the production environment outside of peak hours by actually applying load to the production environment.
Preparation for the actual exam
Of course, when using a production environment, the service must not go down due to excessive load.
So, to get a more accurate understanding of the status of the service, we reviewed the metrics we normally collect, created dashboards that we felt were necessary, and output detailed logs.
This makes it easier to understand the status of the service more quickly and concisely than before.
Let's load test
The number of requests when the error that initially triggered the problem was already known from the logs and metrics we normally collect. Based on that, we estimated the maximum number of requests that would ultimately be received, and then decided on the amount of requests to apply to the load, taking into consideration a little more.
Now that we're ready, we'll begin the actual load test.
The team meets and applies load from the load test environment they created during off-peak hours.
When we actually applied load, we found that the resources of the service in the backend of the function in question were running low.
We increased the resources in question (because the underlying service in question is an old service that runs on VMs and does not have features such as auto-scaling) and applied load again, and were able to confirm that even when the expected number of requests came in, the error rate did not increase and requests were returned normally.
Review after the actual load test
The need to conduct load testing this time arose because the capacity of each system was not yet known.
The main reason for this was that there was no documentation on how many requests could be handled with the given amount of resources, and so it was not possible to grasp this. (Although these days, with the use of auto-scaling systems, there may be fewer opportunities to worry about this.)
Taking this into consideration, we first conducted load testing in a development environment to verify how many requests the target service could handle in one pod, and whether the number of requests it could handle would scale when it was scaled, and we also worked to leave documentation behind.
Furthermore, the load testing revealed some things that could be improved. Based on these findings, the team held a meeting to prioritize the issues that needed to be addressed and addressed them in order.
Conclusion
This time, we looked back on how we thought about and carried out load testing despite limited time and resources.
Due to the following factors, we were unable to determine how much load the system could withstand.
- No load testing was done when it was created
- Load testing was performed but no documentation was kept
- I kept it in the documentation but it was lost due to the migration of the documentation tool used.
As a result of this project, we created a document that clearly states how much performance the system can achieve with how many resources it needs after correcting bottlenecks through load testing.
At the moment, we have left the documents in a common document repository that is shared by all team members, but there are still issues such as documents not being carried over when teams change, or documents being lost when the documentation tool used is changed.
It seems we need to continue thinking about this issue going forward.
SRG is looking for people to work with us.
If you're interested, please contact us here.
SRG runs a podcast where we chat about the latest hot topics in IT technology and books. We hope you'll listen to it while you work.