We performed load testing using the production environment.
This is Taniguchi from the Service Reliability Group (SRG) of the Technology Division.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article describes my experience conducting load tests to understand capacity through function switching.
About this articleProject progressOkay, let's create a load testing environment.Load testing in the development environmentAfter completing load testing in the develop environmentTesting using the production environmentPreparing for the actual examLet's load testReview after conducting load testing in the actual production environmentIn conclusion
About this article
During a system overhaul, a project was underway to create new features and facilitate the transition from old features to the new ones.
Since many services were using it, we were requesting the switchover in order and proceeding with the transition.
While the initial few services were switching over, functionality was provided without any problems. However, after several services had completed their switchover, we experienced an increase in latency and error rates when certain services reached their peak.
Therefore, we decided to conduct load testing to investigate and resolve the increase in error rate and latency.
Project progress
Okay, let's create a load testing environment.
We decided to start by testing in a readily available development environment.
The load scenario was created by determining the proportion of endpoints that were actually being hit by the system, and then preparing load scenarios that would hit each endpoint at roughly the same rate.
Load testing in the development environment
As a prerequisite, the system in question is running on Kubernetes.
An error occurred when I applied load to the develop environment.
The system in question uses Java, and to investigate the error, we obtained a thread dump to examine where the process was getting stuck and made corrections to the suspicious parts.
We performed another load test to confirm that there were no problems after the system modifications.
After the fix, the errors stopped occurring, but the RPS stopped increasing beyond a certain load level.
Upon investigating the cause, it was discovered that a system in a different environment, with which the affected system was communicating, was acting as a bottleneck, preventing it from receiving sufficient load.
After completing load testing in the develop environment
In order to apply a consistent load to the development environment, we needed to enhance systems in other environments as well.
There are not just one but multiple systems in different environments, and even if we increase resources, we will incur costs for expansion and for the additional resources themselves.
Adding multiple separate environments will also take a considerable amount of time.
Testing using the production environment
The part causing the error has been fixed through testing in the development environment, and it is assumed that it should now be able to handle requests. However, if it were to return an error and cause trouble for the project that would have to switch over when it actually receives a similar load, it was necessary to confirm whether it could actually handle requests.
As mentioned earlier, it was difficult to apply sufficient load in the development environment, so we decided to test by applying actual load to the production environment outside of its peak hours.
Preparing for the actual exam
Of course, when using the production environment, it is essential that the service does not go down due to the load it experiences.
Therefore, in order to get a more accurate understanding of the service status, we reviewed the metrics we normally collect, created additional dashboards that seemed necessary, and started outputting more detailed logs.
This has made it easier to grasp the status of services more quickly and concisely than before.
Let's do a load test
The number of requests at the time of the initial error was determined from the logs and metrics we normally collect. Based on that, we estimated the maximum number of requests that were likely to occur, plus a little extra, to determine the amount of requests to load.
Now that everything is ready, we will proceed with the load test.
The team gathers and applies a load test environment, created during off-peak hours, to test the system's performance.
When we actually put a load on the system, we were able to find that the resources of the backend service for the relevant function were nearly depleted.
By increasing the resources in question (because the backend of the service is an older service running on VMs and lacks features like autoscaling), we were able to reload it and confirm that even with the expected number of requests, the error rate did not increase and requests were returned successfully.
Review after conducting load testing in the actual production environment
The reason we needed to conduct load testing this time was because we didn't know the capacity of each system.
The main reason was that there was no documentation to determine how many requests could be handled with the amount of resources allocated. (Although, with the recent use of auto-scaling systems, this might be less of a concern these days.)
This time, taking those points into consideration, we initially performed load testing in the development environment to verify how many requests the target service could handle with one pod, and whether the number of requests it could handle would scale similarly when scaled. We also took the time to document these findings.
Furthermore, the load testing revealed some insights and allowed us to identify areas for improvement. Based on these findings, we held a team meeting to prioritize the issues that needed addressing and then addressed them sequentially.
In conclusion
This time, I reflected on how we considered and executed load testing given the limited time and resources available.
Due to the following factors, we were unable to determine how much load the system could withstand.
- No load testing was performed during creation.
- We conducted load testing, but did not keep any documentation.
- I had documented it, but it was lost when the documenting tool I was using was migrated.
As a result of this project, we created documentation that clearly specifies how much performance the system can achieve with a given amount of resources, after correcting bottlenecks through load testing.
Currently, we've stored the documents in a shared document repository that everyone on the team is aware of. However, this still presents problems such as documents not being handed over when the team changes, or documents being lost when the document tools used change.
It seems we will need to continue thinking about the issues that lie ahead.
SRG is looking for new team members.
If you are interested, please contact us here.
SRG runs a podcast where we chat about the latest hot IT technologies and books. We hope you'll enjoy listening to it while you work.
