HOME/Articles/How to make the switch to Aurora MySQL Version 3 safer

How to make the switch to Aurora MySQL Version 3 safer

2024/5/31 15:172024/9/20 8:54

This is Onkai Yuta (@fat47) from the Service Reliability Group (SRG) of the Media Headquarters.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

This article explains how to make a more careful switch to Aurora MySQL Version 3.

Is your migration to Aurora MySQL v3 going smoothly?Previous steps to switch to v3 Problems occurred after changing reference destination to v3 Switching procedure after a problem Good points What went wrong 1) Change the long_query_time in the v2 cluster slow log to 0 or 0.1 2) Use the clone function to duplicate the v2 cluster to create clusters A and B, and then upgrade cluster B to v3.3) Run pt-upgrade included in percona-toolkit on cluster A and cluster B.4) Creating B/G Deployments 5) ※ Create a record for weighted routing in Route53 6) Set the application's reference direction to the record created in (5).7) Change the reference back to the original reader endpoint and perform the B/G switch.Conclusion

Is your migration to Aurora MySQL v3 going smoothly?

I hope everyone is making progress in migrating to Aurora MySQL v3. There are now just under five months until the end of standard support.

I thought things were going well, but then I ran into some trouble and things are a bit delayed.

Previous steps to switch to v3

Until now, the switch to v3 has generally been carried out in the following order:

Create B/G Deployments in the development and staging environments and switch between them to check operation.

Create B/G Deployments in production environment

Check by changing the reference query destination of the production application to the Green reader endpoint

Reverting to the original reader endpoint

Execute B/G switching

Problems occurred after changing reference destination to v3

After pointing the reference to v3, the performance of a certain query became 100 times worse.

The service had reached a level where it could not be continued, so we switched it back.

I will explain the details in another article,

A different index was selected than when running on the v2 cluster.

The root cause was that queries were being executed in a full scan on the v3 cluster.

Fortunately, this time we only had to change the application's destination, so we were able to quickly switch it back.

Switching procedure after a problem

Here are some good and bad things about this transition:

Good points

B/G switching was not performed, and only the application reference was changed to check the operation.

What went wrong

All reference queries were immediately directed to the Aurora MySQL Version 3 endpoint.

Although the operation was confirmed in the staging environment, it was not possible to compare the execution plans of the v2 cluster and the v3 cluster using the production environment data.

Taking these things into consideration, for clusters where user impact is likely to be particularly large,

I decided to make the switch in the following way:

Change long_query_time of v2 cluster slow log to 0 or 0.1

Create clusters A and B by duplicating the v2 cluster using the clone function, and upgrade cluster B to v3.

Run pt-upgrade included in percona-toolkit on cluster A and cluster B.

At runtime, execute a query on the slow log source obtained in (1)

Creating B/G Deployments

(※) Creating a record for weighted routing in Route53

Register the blue reader endpoint and the green reader endpoint with a weighting of 10:1, for example.

Point the application's reference to the record created in (5)

(※) If weighted records cannot be used due to driver or implementation issues, you can address the issue by redirecting queries to the Green reader endpoint for only a portion of the application servers.

Change the reference back to the original reader endpoint and perform the B/G switch.

1) Change the long_query_time in the v2 cluster slow log to 0 or 0.1

This is to collect as many sample queries as possible in the existing environment.

Since the amount of data will be very large, we will decide on a certain amount of time, such as a few hours, and then make the changes.

2) Duplicate the v2 cluster using the clone function to create clusters A and B, then upgrade cluster B to v3.

Create an environment for verification using pt-upgrade described later. Upgrade Cluster B in-place.

The reason we are not using the B/G Deployments function yet is that running pt-upgrade on a production environment is not recommended, so if we run it on an environment created with B/G, a validation query will be sent to the production Blue environment.

3) Run pt-upgrade included in percona-toolkit on cluster A and cluster B.

For an explanation of pt-upgrade, please refer to the article written by our intern.

I tried out some useful features of percona-toolkit that make MySQL operation easier - CyberAgent SRG #ca_srg

My name is Tomoya Suzuki (@oshiborii), and I am an intern in the Service Reliability Group (SRG) of the Media Headquarters. The #SRG (Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, and is responsible for improving existing services, launching new ones, and contributing to OSS.

https://ca-srg.dev/64d8ab3eacd04c65ace906b769b36db4#block-05a1dbee3758439b87fc636d02d7ecf9

Roughly speaking, you can run the same query on two clusters and compare their effective speed.

The queries to be executed can also be generated from the slow logs.

Using this tool, we generate queries from the query logs of the existing production environment obtained in (1) and submit them to clusters A and B simultaneously.

So we'll check to see if there are any queries that have slowed down significantly.

4) Creating B/G Deployments

If there are no slow queries or if you can improve the speed, you can use the B/G function to create a Green cluster.

5) ※ Creating records for weighted routing in Route53

This failure caused all read queries to be directed to the Green cluster, making the impact more widespread.

We will use Route53's weighted routing to minimize the impact to users as much as possible.

We will register CNAME records with a ratio of 10:1, with the Blue leader endpoint weighted at 10 and the Green leader endpoint weighted at 1.

*The reason for the * is that it may not be possible to access the cluster endpoint via weighted records due to driver or implementation issues.

6) Point the application's reference to the record created in (5)

Change the application's destination.

If for some reason weighted routing records cannot be used at this time, alternatives such as pointing only some application servers to the Green endpoint are also acceptable.

If there are no issues, gradually increase the weighting ratio until eventually all read queries are directed to the green leader endpoint.

7) Change the reference back to the original reader endpoint and perform the B/G switch.

After checking the above and confirming that there are no problems, change the application's reference back to the original endpoint and perform the B/G switch to finish.

Conclusion

This issue did not occur in the development and staging environments, which had a small amount of data.

This made me realize once again how important it is to conduct verification using the actual amount of data.

Although it will be more work, we would like to continue upgrading in a safer manner.

SRG is looking for people to work with us. If you are interested, please contact us here.

Recruitment information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SREs to the media business as a cross-sectional SRE under the vision of "improving reliability across the media business." The work is centered around the following three pillars: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers