How to switch to Aurora MySQL Version 3 more safely

This is Yuta Kikai (@fat47) from the Service Reliability Group (SRG) of the Media Division.
#SRGThe Service Reliability Group primarily provides comprehensive support for the infrastructure surrounding our media services, focusing on improving existing services, launching new ones, and contributing to open-source software (OSS).
This article explains how to proceed with a more cautious transition to Aurora MySQL Version 3.
 

Is your migration to Aurora MySQL v3 going smoothly?


How is everyone's migration to Aurora MySQL v3 going? We're now less than five months away from the end of standard support.
I thought everything was going smoothly, but I've encountered some trouble, so there's been a slight delay.

The process of switching to v3 up to now


Up until now, the transition to v3 has generally been carried out in the following manner.
  • Create B/G Deployments in the development and staging environments to test their functionality and enable switching between them.
  • B/G deployments created in the production environment
  • Change the target of the production application's reference queries to the Green Reader endpoint and verify.
  • Revert to the original leader endpoint
  • B/G switching executed.

Trouble occurred after changing the reference point to v3.


After switching to v3, the performance of a certain query worsened by more than 100 times.
We rolled back the service because it had reached a point where it could no longer be continued.
 
I'll go into more detail in another article, but
A different index was selected than when it was running on a v2 cluster.
The root cause was that query execution on the v3 cluster was performing a full scan.
Fortunately, this time it was just a matter of changing the application's target, so we were able to revert it back immediately.

The process for switching over after a problem occurs.


The following are the positive and negative aspects of this transition process.

Good points

  • The B/G switching was not performed; instead, only the application's reference was changed to confirm its operation.

What went wrong

  • Suddenly, all reference queries are directed to the Aurora MySQL Version 3 endpoint.
  • Although we had confirmed operation in the staging environment, we had not been able to compare the execution plans of the v2 cluster and the v3 cluster using production environment data.
 
Taking these factors into consideration, especially for clusters where user impact is likely to be significant,
We decided to proceed with the transition in the following manner.
 
  1. Change the long_query_time of the slow log for the v2 cluster to 0 or 0.1.
  1. Create clusters A and B by cloning the v2 cluster, and then upgrade cluster B to v3.
  1. Run pt-upgrade, included in percona-toolkit, on clusters A and B.
    1. During execution, the query is executed on the source of the slow log obtained in (1).
  1. Creating B/G Deployments
  1. (*) Creating a record for weighted routing in Route 53
    1. Register Blue leader endpoints and Green leader endpoints with a weighting such as 10:1.
  1. The application's reference point is directed to the record created in (5).
    1. (*) If weighted records cannot be used due to driver or implementation issues, address this by changing the query destination for only a portion of the application server to the Green Reader endpoint.
  1. The reference point is returned to the original leader endpoint, and the B/G switch is performed.
 

1) Change the long_query_time of the slow log in the v2 cluster to 0 or 0.1.

This is to collect as many sample queries as possible in the existing environment.
Since the amount of data would become very large, we will set a time limit, such as a few hours, for the changes.
 

2) The v2 cluster was duplicated using the cloning function to create clusters A and B, and then cluster B was upgraded to v3.

We will create an environment for testing using pt-upgrade, which will be described later. The upgrade of Cluster B will be performed in place.
The reason we're not using the B/G Deployments feature yet is that running pt-upgrade on a production environment is discouraged. If you run it on an environment created with B/G, validation queries will be sent to the production Blue environment.
 

3) Run pt-upgrade, included in percona-toolkit, on clusters A and B.

For an explanation of pt-upgrade, please refer to the article written by one of our interns.
 
To put it simply, you can run the same query on two clusters and compare their execution speed.
The queries to be executed can also be generated from the slow log.
This tool is used to generate queries from the existing production environment query logs obtained in (1), and to send them simultaneously to clusters A and B.
Therefore, we will check if there are any queries that have experienced a significant decrease in speed.
 

4) Creating B/G Deployments

If there are no queries experiencing performance degradation, or if improvements can be made, we will use the B/G function to create a Green cluster.
 

5) ※ Creating a record for weighted routing in Route 53

This failure, which redirected all reference queries to the Green cluster, has had a more widespread impact.
To minimize the impact on users as much as possible, we will use Route53's weighted routing.
We will register records using CNAME with a weight of 10 for the Blue leader endpoint and a weight of 1 for the Green leader endpoint, in a ratio of 10:1.
The reason for the asterisk (*) is that, due to driver or implementation limitations, it may not be possible to access the cluster endpoint via weighted records.
 

6) The application's reference point is directed to the record created in (5).

This changes the application's target.
If, for some reason, weighted routing records are unavailable at this time, alternative solutions such as directing only some application servers to the Green endpoint are also acceptable.
If there are no issues, gradually increase the weighting ratio until all referral queries are directed to the Green Reader endpoint.
 

7) Return the reference to the original leader endpoint and perform the B/G switchover.

Once you have performed the above checks and confirmed that everything is working correctly, revert the application's reference to the original endpoint, perform the B/G switch, and you're done.

In conclusion


This issue did not occur in development or staging environments with small amounts of data, so
I've come to realize once again the importance of conducting tests with the actual amount of data used in production.
It will increase the workload, but we want to continue upgrading in a safer way.
 
SRG is looking for new team members. If you are interested, please contact us here.