How to switch to Aurora MySQL Version 3 more safely

This is Onkai Yuta (@fat47) from the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article explains how to switch to Aurora MySQL Version 3 more carefully.
 

Is your migration to Aurora MySQL v3 going smoothly?


I hope everyone is making progress with their migration to Aurora MySQL v3. There are now just under five months until the end of standard support.
I thought things were going well, but then I ran into some trouble and things are a little delayed.

The current process for switching to v3


Up until now, the switch to v3 has generally been carried out in the following order:
  • Create B/G Deployments in the development and staging environments, check operation, and switch
  • Create B/G Deployments in a production environment
  • Change the destination of the production application's reference query to the Green reader endpoint and check
  • Reverting to the original reader endpoint
  • B/G switching execution

Problems occurred after changing reference destination to v3


After pointing the reference to v3, the performance of a certain query became over 100 times worse.
The service had reached a level where it could not be continued, so we switched it back.
 
I will explain the details in another article,
A different index was selected than when running on the v2 cluster.
The root cause was that queries were being executed in a full scan on the v3 cluster.
Fortunately, this was just a change in the application's destination, so we were able to quickly revert.

Switching procedure after a problem


Here are some good and bad points about this transition:

Good points

  • B/G switching was not performed, and only the application reference was changed to check operation.

What went wrong

  • All reference queries were immediately directed to the Aurora MySQL Version 3 endpoint.
  • Although the operation was confirmed in the staging environment, the execution plans of the v2 and v3 clusters were not compared using production environment data.
 
Taking these into consideration, for clusters where the impact on users is likely to be particularly large,
I decided to make the switch in the following way:
 
  1. Change long_query_time in the v2 cluster slow log to 0 or 0.1
  1. Create clusters A and B by duplicating the v2 cluster using the clone function, and then upgrade cluster B to v3.
  1. Run pt-upgrade included in percona-toolkit on cluster A and cluster B
    1. At runtime, execute a query on the source of the slow log obtained in (1)
  1. Creating B/G Deployments
  1. (※) Creating a record for weighted routing in Route53
    1. Register the Blue reader endpoint and the Green reader endpoint with a weighting such as 10:1.
  1. Point the application's reference to the record created in (5)
    1. (※) If weighted records cannot be used due to driver or implementation issues, address the issue by changing the query destination to the Green reader endpoint for only a portion of the application servers.
  1. Return the reference to the original reader endpoint and perform the B/G switch.
 

1) Change the long_query_time in the v2 cluster slow log to 0 or 0.1

This is to collect as many sample queries as possible in the existing environment.
Since the amount of data will be very large, we will change it for a certain period of time, such as a few hours.
 

2) Use the clone function to duplicate the v2 cluster to create clusters A and B, then upgrade cluster B to v3.

Create a verification environment using pt-upgrade, which will be described later. Cluster B will be upgraded in-place.
The reason we are not using the B/G Deployments function yet is that running pt-upgrade on a production environment is not recommended, so if we run it on an environment created with B/G, a verification query will be sent to the production Blue environment.
 

3) Run pt-upgrade included in percona-toolkit on cluster A and cluster B

For an explanation of pt-upgrade, please refer to the article written by an intern.
 
Roughly speaking, you can run the same query on two clusters and compare their effective speeds.
The queries to be executed can also be generated from the slow log.
Using this tool, we generate queries from the query log of the existing production environment obtained in (1) and submit them to clusters A and B simultaneously.
So we'll check to see if there are any queries that have slowed down significantly.
 

4) Creating B/G Deployments

If there are no slow queries or if you can improve them, you can use the B/G function to create a Green cluster.
 

5) ※ Creating a record for weighted routing in Route 53

This failure caused all read queries to be directed to the Green cluster, which had a wider impact.
We will use Route53's weighted routing to minimize the impact on users as much as possible.
We will register CNAME records with a weighting of 10 for the Blue reader endpoint and 1 for the Green reader endpoint, with a ratio of 10:1.
The reason for the * is that it may not be possible to access the cluster endpoint via weighted records due to driver or implementation issues.
 

6) Point the application's reference to the record created in (5)

Change the application's destination.
If for some reason weighted routing records are not available, alternatives such as pointing only some application servers to the Green endpoint are also acceptable.
If there are no issues, gradually increase the weighting ratio until eventually all read queries are directed to the green leader endpoint.
 

7) Return the reference to the original reader endpoint and perform a B/G switch.

After checking the above and confirming that there are no problems, change the application's reference back to the original endpoint, perform the B/G switch, and you're done.

Conclusion


This issue did not occur in the development environment or staging environment, which had a small amount of data.
This experience has made me realize once again the importance of conducting verification using the actual amount of data.
Although it will be more work, we would like to continue upgrading in a safer way.
 
SRG is looking for people to work with us. If you're interested, please contact us here.