Dramatic improvement in Terraform execution time on AmebaPlatform
My name is Taninari, and I work in the Service Reliability Group (SRG) of the Media Headquarters.
#SRG(Service Reliability Group) mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, contributing to OSS, etc.
This article introduces some activities being carried out on GitHub Actions to improve the execution time of Terraform.
FirstTechnology (service) used this timeTiming of terraform Plan and Apply execution for each environmentUnderstanding the current situationCause analysis and correction planImplementation of Plan/Apply on GitHubActionsImplementing health checkImprovement resultsConclusion
First
terraform apply
There were many requirements, such as changing resources such as security groups and ALB ListenerRules, or upgrading resources, and for a long time it took a lot of time for all users to perform such operations.
In this article, we will introduce some of the improvements we have made to terraform running on GitHub Actions to improve this situation.
Technology (service) used this time
- terraform
- GitHub Actions
Timing of terraform Plan and Apply execution for each environment
Basically, when you submit a Pull Request (PR) for a branch for a specific environment, the Plan process is run, other people review it, and once it is merged, the Apply process is run.
For example, if you want to make a plan for the devlop environment, you submit a PR to the develop branch, check the contents of the plan, and if there are no problems, merge and Apply.
Understanding the current situation
My impression was that Plan/Apply took a lot of time, so I measured how long it currently takes to measure how much improvement I've made after making improvements. I started by calling the GitHub API with a python script to get the time it takes to Plan and Apply for one month. Below are the results I got from the script.
From the results above, we can see that both Plan and Apply took more than 5 minutes to execute, even looking at the median values.
Cause analysis and correction plan
I had a hunch about the cause, and since the number of products was small at first, I ran both Plan/Apply for all products. Of course, as the number of products increased, the execution time of Plan/Apply increased. Furthermore, it was a big problem when differences appeared in parts that I had not changed, and I had to investigate the changes and resolve them.
So,Fix PlanConsidering this, we decided to change it so that Plan/Apply would only be executed for products that have been changed on Github.
However, with this method, you may miss changes that you would have been able to notice previously by running Plan, or for products that receive very few updates, Plan may not be run at all, which may result in the product not working when you try to run it.
To avoid this situation occurring when a failure occurs, we decided to create a workflow that would run Plans for all products once a day.
Implementation of Plan/Apply on GitHubActions
Simply put, terraform has the following directory structure.
git diff
Create a unique file name that includes the directory path and PR number, etc., and upload it.
The upload location was originally supposed to be S3, butGitHubI was told that Actions has a place where you can upload files called artifacts, so I decided to use that. There are restrictions on size, retention period, number of files, file name length, etc., but I was able to determine that none of them seemed to be a problem.
By the way, file uploading can be easily done with GitHub Actions.upload-artifactI used the following.
Check the Plan results and if there are no problems, run Apply. As mentioned earlier, Apply runs when the user performs a merge. When Apply runs, a unique path created using the PR number etc. is used to download from the artifacts. Apply is run on the products that have been checked by Plan using the downloaded files.
Implementing health check
workflow_dispatch
Furthermore, while previously only email notifications were sent when Plan/Apply failed, we have improved this so that notifications are sent to a specific Slack channel in case of failure, making it easier for the person in charge or the person executing the plan to notice. This allows the person in charge or the user to notice the failure immediately.
Improvement results
We are posting the honest values for the past month.
335.0 -> 100.0(改善率約70.15%)
The reason why Apply is taking so long is that at the time of the improvement, we were using EKS self-managed groups, so there was no updating of node groups on the terraform side, but now that we have introduced karpenter, it takes quite a long time to update managed node groups, which is why the Max and Average values seem to be large.
Conclusion
In this article, we introduced a case study of how AmebaPlatform improved Plan/Apply by using GitHubActions. Through this activity, we were able to reduce the time required to execute Plan/Apply, which previously took a lot of time, and improve productivity.
It seems there are other areas where productivity can be improved, so I hope to make some improvements and introduce them on my blog again.
SRG is looking for people to work with us. If you are interested, please contact us here.