How to make the switch to Aurora MySQL Version 3 safer
This is Onkai Yuta (@fat47) from the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article explains how to make a more careful switch to Aurora MySQL Version 3.
Is your migration to Aurora MySQL v3 going smoothly?Previous steps to switch to v3Problems occurred after changing reference destination to v3Switching procedure after a problemGood pointsWhat went wrong1) Change the long_query_time in the v2 cluster slow log to 0 or 0.12) Use the clone function to duplicate the v2 cluster to create clusters A and B, and then upgrade cluster B to v3.3) Run pt-upgrade included in percona-toolkit on cluster A and cluster B.4) Creating B/G Deployments5) ※ Create a record for weighted routing in Route536) Set the application's reference direction to the record created in (5).7) Change the reference back to the original reader endpoint and perform the B/G switch.Conclusion
Is your migration to Aurora MySQL v3 going smoothly?
I hope everyone is making progress in migrating to Aurora MySQL v3. There are now just under five months until the end of standard support.
I thought things were going well, but then I ran into some trouble and things are a bit delayed.
Previous steps to switch to v3
Until now, the switch to v3 has generally been carried out in the following order:
- Create B/G Deployments in the development and staging environments and switch between them to check operation.
- Create B/G Deployments in production environment
- Check by changing the reference query destination of the production application to the Green reader endpoint
- Reverting to the original reader endpoint
- Execute B/G switching
Problems occurred after changing reference destination to v3
After pointing the reference to v3, the performance of a certain query became 100 times worse.
The service had reached a level where it could not be continued, so we switched it back.
I will explain the details in another article,
A different index was selected than when running on the v2 cluster.
The root cause was that queries were being executed in a full scan on the v3 cluster.
Fortunately, this time we only had to change the application's destination, so we were able to quickly switch it back.
Switching procedure after a problem
Here are some good and bad things about this transition:
Good points
- B/G switching was not performed, and only the application reference was changed to check the operation.
What went wrong
- All reference queries were immediately directed to the Aurora MySQL Version 3 endpoint.
- Although the operation was confirmed in the staging environment, it was not possible to compare the execution plans of the v2 cluster and the v3 cluster using the production environment data.
Taking these things into consideration, for clusters where user impact is likely to be particularly large,
I decided to make the switch in the following way:
- Change long_query_time of v2 cluster slow log to 0 or 0.1
- Create clusters A and B by duplicating the v2 cluster using the clone function, and upgrade cluster B to v3.
- Run pt-upgrade included in percona-toolkit on cluster A and cluster B.
- At runtime, execute a query on the slow log source obtained in (1)
- Creating B/G Deployments
- (※) Creating a record for weighted routing in Route53
- Register the blue reader endpoint and the green reader endpoint with a weighting of 10:1, for example.
- Point the application's reference to the record created in (5)
- (※) If weighted records cannot be used due to driver or implementation issues, you can address the issue by redirecting queries to the Green reader endpoint for only a portion of the application servers.
- Change the reference back to the original reader endpoint and perform the B/G switch.
1) Change the long_query_time in the v2 cluster slow log to 0 or 0.1
This is to collect as many sample queries as possible in the existing environment.
Since the amount of data will be very large, we will decide on a certain amount of time, such as a few hours, and then make the changes.
2) Duplicate the v2 cluster using the clone function to create clusters A and B, then upgrade cluster B to v3.
Create an environment for verification using pt-upgrade described later. Upgrade Cluster B in-place.
The reason we are not using the B/G Deployments function yet is that running pt-upgrade on a production environment is not recommended, so if we run it on an environment created with B/G, a validation query will be sent to the production Blue environment.
3) Run pt-upgrade included in percona-toolkit on cluster A and cluster B.
For an explanation of pt-upgrade, please refer to the article written by our intern.
Roughly speaking, you can run the same query on two clusters and compare their effective speed.
The queries to be executed can also be generated from the slow logs.
Using this tool, we generate queries from the query logs of the existing production environment obtained in (1) and submit them to clusters A and B simultaneously.
So we'll check to see if there are any queries that have slowed down significantly.
4) Creating B/G Deployments
If there are no slow queries or if you can improve the speed, you can use the B/G function to create a Green cluster.
5) ※ Creating records for weighted routing in Route53
This failure caused all read queries to be directed to the Green cluster, making the impact more widespread.
We will use Route53's weighted routing to minimize the impact to users as much as possible.
We will register CNAME records with a ratio of 10:1, with the Blue leader endpoint weighted at 10 and the Green leader endpoint weighted at 1.
*The reason for the * is that it may not be possible to access the cluster endpoint via weighted records due to driver or implementation issues.
6) Point the application's reference to the record created in (5)
Change the application's destination.
If for some reason weighted routing records cannot be used at this time, alternatives such as pointing only some application servers to the Green endpoint are also acceptable.
If there are no issues, gradually increase the weighting ratio until eventually all read queries are directed to the green leader endpoint.
7) Change the reference back to the original reader endpoint and perform the B/G switch.
After checking the above and confirming that there are no problems, change the application's reference back to the original endpoint and perform the B/G switch to finish.
Conclusion
This issue did not occur in the development and staging environments, which had a small amount of data.
This made me realize once again how important it is to conduct verification using the actual amount of data.
Although it will be more work, we would like to continue upgrading in a safer manner.
SRG is looking for people to work with us. If you are interested, please contact us here.