Terraform: 84 tfstates and 5 repositories turned into a monorepo, and how to operate Terraform with Atlantis

Mr. Hasegawa (@rarirureluis)is.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article is an attempt to reconsider the operation of Terraform.
 
 

Before becoming a monorepo


The Terraform for this service has a structure that is completely separated for each environment and resource, as shown below.
 
There are five repositories with the directory structure shown above, and five AWS accounts.
We won't go into the merits and demerits of this directory structure here, but our goal is to reduce toil by consolidating all of these into a single repository and managing CI in one place.
 

How do you run Terraform?


This service operated Terraform using GitHub Actions or CodeBuild.
In this operation, when a PR is created, a Plan is run, and if there are no problems, the plan is merged and Apply is executed.

Patterns where Apply fails

It is possible for the Plan to be successful but the Apply to fail.
For example, if the ECR repository is not empty, or if deletion protection is enabled for EC2, ALB, etc.
When this situation occurred, we would repeatedly create a PR → Plan → Approve → Merge (Apply) to Plan/Apply again.
This is a toil that SREs cannot ignore.
 

Atlantis should have been introduced sooner...


Atlantis is a CI tool for Terraform.
 

The problem Atlantis solves

  • Prevents patterns where Apply fails
    • IssueOps and branch deploymentsThe merged changes have been applied correctly.You can create
  • Prevents conflicting state updates
    • You can see a list of state changes being made from currently open PRs.
And introducing Atlantis can solve many problems.
 

Prevents patterns where Apply fails

In Atlantis, all operations are done through comments on issues (PRs).
atlantis apply
This means that you will not encounter issues such as Apply failing after merging and having to create a PR to Plan/Apply again.
atlantis apply
 
This means that the merged changes are correctly applied, and in terms of technique this is called "branch deployment."
The other day, GitHub officially introduced it on their blog.
 

Prevents conflicting state updates

Atlantis allows you to take locks on state changes made through PRs.
This way, if another PR detects a change to that state, it will return an error stating that it is already locked.
The locked state can be viewed on the Atlantis web, where you can also view the Terraform execution logs.
 

How Atlantis is built

This time we used the official Atlantis Terraform module.
 

How Atlantis works


Here's how Atlantis works.
Atlantis is triggered by a webhook from GitHub (it also supports other services), and communication between Atlantis and GitHub is carried out via an API.
Therefore, you will need a GitHub App or a Personal Access Token to build Atlantis.
 
In the Git flow, Plan runs when you create a PR.
At least one Approve is required to Apply, and once Apply is complete, the PR will be automatically closed and the branch will be automatically deleted.
These policies can be defined in detail in atlantis.yaml.
 

One Atlantis to multiple AWS environments

Terraform supports Assume Role, so we configured it as follows.
 

Tips


Here are some other good points besides those already mentioned.

Terraform version can be specified per directory

This is a very good feature.
Even before the monorepo, the situation was completely separate, but thanks to this feature, there is no need to unify Terraform versions when migrating to the monorepo.

parallel_plan and parallel_apply

Atlantis automatically marks changed directories as changes, but if you add 40 directories at once, as in this case, it will take a considerable amount of time to Plan and Apply.
However, the option to run in parallel reduces latency considerably.
💡
If you run 40 or more tasks at once, a timeout will occur. In this case, try increasing the vCPU or memory of the task.
 

Conclusion


Atlantis is open source, the Terraform used to build it is flexible, and Atlantis itself is simple in design, so it can be easily recreated even when in production operation.
We already had a Terraform CI environment in place, but we migrated to Atlantis. There were only advantages and disadvantages.The monthly cost is slightly higher than GitHub Actions/CodeBuildThe process of migrating to a monorepo was hellish.
In terms of infrastructure costs, the Fargate 0.5vCPU/1GB specs are completely fine, so I think the cost is negligible.
Although we've introduced Atlantis a little late, why not take this opportunity to consider introducing it to your business?
SRG is looking for people to work with us. If you're interested, please contact us here.
 
SRG runs a podcast where we chat about the latest hot topics in IT technology and books. We hope you'll listen to it while you work.