HOME/Articles/Dramatic improvement in Terraform execution time on AmebaPlatform

Dramatic improvement in Terraform execution time on AmebaPlatform

2024/12/21 14:012024/12/22 14:48

This article isCyberAgent Group SRE Advent Calendar 2024This is the 22nd article of

My name is Taninari, and I work in the Service Reliability Group (SRG) of the Media Headquarters.

#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.

This article introduces some activities being carried out on GitHub Actions to improve the execution time of Terraform.

First Technology (service) used this time Timing of terraform Plan and Apply execution for each environment Understanding the current situation Cause analysis and correction plan Implementation of Plan/Apply on GitHubActions Implementing health check Improvement results Conclusion

First

terraform apply

There were many requirements, such as changing resources such as security groups and ALB ListenerRules, or upgrading resources, and for a long time it took a lot of time for all users to perform such operations.

In this article, we will introduce some of the improvements we have made to terraform running on GitHub Actions to improve this situation.

Technology (service) used this time

terraform

GitHub Actions

Timing of terraform Plan and Apply execution for each environment

Basically, when you submit a Pull Request (PR) for a branch for a specific environment, the Plan process is run, other people review it, and once it is merged, the Apply process is run.

For example, if you want to make a plan for the devlop environment, you submit a PR to the develop branch, check the contents of the plan, and if there are no problems, merge and Apply.

Understanding the current situation

My impression was that Plan/Apply took a lot of time, so I measured how long it currently takes to measure how much improvement I've made after making improvements. I started by calling the GitHub API with a python script to get the time it takes to Plan and Apply for one month. Below are the results I got from the script.

From the results above, we can see that both Plan and Apply took more than 5 minutes to execute, even looking at the median values.

Cause analysis and correction plan

I had a hunch about the cause, and since the number of products was small at first, I ran both Plan/Apply for all products. Of course, as the number of products increased, the execution time of Plan/Apply increased. Furthermore, it was a big problem when differences appeared in parts that I had not changed, and I had to investigate the changes and resolve them.

So,Fix PlanConsidering this, we decided to change it so that Plan/Apply would only be executed for products that have been changed on Github.

However, with this method, you may miss changes that you would have been able to notice previously by running Plan, or for products that receive very few updates, Plan may not be run at all, which may result in the product not working when you try to run it.

To avoid this situation occurring when a failure occurs, we decided to create a workflow that would run Plans for all products once a day.

Implementation of Plan/Apply on GitHubActions

Simply put, terraform has the following directory structure.

git diff

Create a unique file name that includes the directory path and PR number, etc., and upload it.

The upload location was originally supposed to be S3, butGitHubI was told that Actions has a place where you can upload files called artifacts, so I decided to use that. There are restrictions on size, retention period, number of files, file name length, etc., but I was able to determine that none of them seemed to be a problem.

By the way, file uploading can be easily done with GitHub Actions.upload-artifactI used the following.

Check the Plan results and if there are no problems, run Apply. As mentioned earlier, Apply runs when the user performs a merge. When Apply runs, a unique path created using the PR number etc. is used to download from the artifacts. Apply is run on the products that have been checked by Plan using the downloaded files.

Implementing health check

workflow_dispatch

Furthermore, while previously only email notifications were sent when Plan/Apply failed, we have improved this so that notifications are sent to a specific Slack channel in case of failure, making it easier for the person in charge or the person executing the plan to notice. This allows the person in charge or the user to notice the failure immediately.

Improvement results

We are posting the honest values for the past month.

335.0 -> 100.0（改善率約70.15%）

The reason why Apply is taking so long is that at the time of the improvement, we were using EKS self-managed groups, so there was no updating of node groups on the terraform side, but now that we have introduced karpenter, it takes quite a long time to update managed node groups, which is why the Max and Average values seem to be large.

Conclusion

In this article, we introduced a case study of how AmebaPlatform improved Plan/Apply by using GitHubActions. Through this activity, we were able to reduce the time required to execute Plan/Apply, which previously took a lot of time, and improve productivity.

It seems there are other areas where productivity can be improved, so I hope to make some improvements and introduce them on my blog again.

SRG is looking for people to work with us. If you are interested, please contact us here.

Recruitment information - CyberAgent SRG #ca_srg

About SRG SRG (Service Reliability Group) is working to improve reliability by promoting the introduction of SREs to the media business as a cross-sectional SRE under the vision of "improving reliability across the media business." The work is centered around the following three pillars: Consolidating and deploying the technical know-how of each business

https://ca-srg.dev/careers