Dramatic improvement in Terraform execution time on AmebaPlatform
My name is Taninari, and I work in the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article introduces activities that improved the execution time of Terraform running on GitHub Actions.
FirstTechnology (service) used this timeTiming of terraform Plan and Apply execution for each environmentUnderstanding the current situationCause analysis and correction planImplementation of Plan/Apply on GitHub ActionsImplementing a health checkImprovement resultsConclusion
First
terraform apply
There were many requirements, such as changing resources such as security groups and ALB listener rules, or upgrading resources, and for a long time it took a lot of time for all users to perform these operations.
In this article, we will introduce some of the improvements we made to terraform running on GitHub Actions to improve this situation.
Technology (service) used this time
- terraform
- GitHub Actions
Timing of terraform Plan and Apply execution for each environment
Basically, when you submit a Pull Request (PR) for a branch for a specific environment, Plan runs, other people review it, and once it's merged, Apply runs.
For example, if you want to create a plan for the devlop environment, you submit a PR to the develop branch, check the contents of the plan, and if there are no problems, merge and apply it.
Understanding the current situation
My impression was that Plan/Apply was taking a lot of time, so I measured how long it was currently taking so that I could measure how much improvement I had made after making the improvements. I started by using a python script to access the GitHub API and get the time it took to Plan and Apply for one month. The results I got from the script are below.
From the results above, we can see that both Plan and Apply take more than 5 minutes to execute, even looking at the median values.
Cause analysis and correction plan
I had an idea of the cause, and since the number of products was small at first, I ran both Plan and Apply for all products. Of course, as the number of products increased, the execution time of Plan and Apply increased. Furthermore, it was a big problem when differences appeared in parts that I had not changed, so I had to investigate the changes and resolve them.
So,Corrective PlanConsidering this, we decided to change it so that Plan/Apply would only be executed for products that have been changed on Github.
However, with this method, you may miss changes that you previously noticed by running Plan, and for products that rarely receive updates, Plan may not be run for a long time, which could result in the product not working when you try to run it.
To avoid this situation occurring when a failure occurs, we decided to create a workflow that runs Plan for all products once a day.
Implementation of Plan/Apply on GitHub Actions
Simply put, terraform has the following directory structure.
git diff
Create a unique file name that includes the directory path and PR number, and upload it.
The upload location was originally supposed to be S3,GitHubI was told that Actions has a place where you can upload files called artifacts, so I decided to use that. There are restrictions on size, retention period, number of files, file name length, etc., but I was able to determine that none of them seemed to be a problem.
By the way, file uploading can be easily done with GitHub Actions.upload-artifactI used the following.
Check the results of the Plan and if there are no problems, run Apply. As mentioned earlier, Apply runs when the user performs the merge. When Apply runs, a unique path created using the PR number and other information is used to download the artifact. Apply is run on the product that has been checked by Plan using the downloaded file.
Implementing a health check
workflow_dispatch
Furthermore, while previously only email notifications were sent when Plan/Apply failed, we have now made improvements so that notifications are sent to a specific Slack channel, making it easier for the person in charge or the person executing the plan to notice. This means that the person in charge or the user can immediately notice when a plan or apply fails.
Improvement results
We are posting the honest values for the past month.
335.0 -> 100.0(改善率約70.15%)
The reason why Apply is taking so long is that at the time of the improvement, we were using EKS self-managed groups, so there were no node group updates on the terraform side, but now that we have introduced karpenter, it takes quite a long time to update the managed node groups, which is why the Max and Average values seem to be large.
Conclusion
In this article, we introduced a case study on AmebaPlatform where we used GitHubActions to improve terraform's Plan/Apply. Through this activity, we were able to reduce the execution time of Plan/Apply, which previously took a lot of time, and improve productivity.
There seem to be other areas where productivity can be improved, so I hope to make some improvements and introduce them on my blog again.
SRG is looking for people to work with us.
If you're interested, please contact us here.