How it all started

A few months back we’ve been struggling with a migration of some of our Elasticsearch clusters to Amazon Elasticsearch Service. One part of the project was to have a neat way of setting up, configuring and maintaining this managed infrastructure part, including its side-cars and utilities. So far we’ve been using Ansible for configuration management and provisioning, but the shift from self-hosted to managed services made us rethink the problem and have a fresh start. One of my colleagues suggested that Terraform might be useful for this and we decided to give it a try.

We’ve decided to use this project as an opportunity to evaluate Terraform as an Infrastructure as Code solution for our infrastructure needs. After some time of eager and chaotic coding, we decided to make it more official. The three following statements then defined our goal and motivation:

  • Every engineer at Base should be able to manage his infrastructure systems
  • This process should be predictable, auditable and secure
  • Infrastructure as Code experience and working with GitHub is the most natural way to go

What is Terraform?

Following Terraform’s official documentation: “Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently”. What’s worth mentioning is its simplicity and a very descriptive approach to resources it manages. It also provides embedded mechanisms for making collaborative work possible. It’s JSON compatible syntax – HCL is easy to read and edit by both humans and computers.

How to Terraform?

It took us some time until we were ready to let everyone into our Terraform code repository and start messing around. We had to take care of a few important aspects to make this work in an environment of 100+ engineers. This is how we approached quite common Terraform users’ problems:

State management

It quickly became obvious that we needed to store state files at a remote location. We initially committed them to the repository but it is bad for a number of reasons. Fortunately, Terraform provides easy way to store its state on S3 using its server-side encryption capabilities:

terraform {
 backend "s3" {
   bucket = "<bucket name>"
   key    = "<unique sub-project state file path>"
   region = "<region>"
   ...
   encrypt = true
   kms_key_id = "<KMS key arn>"
 }
}

The above code snippet explains how this can be configured in any Terraform project.

Locking mechanism

It does not sound right to execute two concurrent updates on the same infrastructure part. This common problem is also addressed by Terraform and leverages Amazon’s DynamoDB for that. You just need to add one line to your S3 backend configuration:

terraform {
 backend "s3" {
   ...
   dynamodb_table = "<DynamoDB table name>"
   ...
 }
}

This simple solution is enough to prevent concurrent execution of Terraform commands on the same resource set, but might not be enough for a big team working on Terraform code. We will get back to it later on in the Workflow section.

Privileges

If your infrastructure is spread among many AWS accounts, there is only one way to avoid the IAM hell and maintain a single “source of truth” about your users. The answer is, obviously, assumed roles. Every of our Terraform sub-project uses its specific IAM role to authenticate:

provider "aws" {
 region = "<region>" 
 assume_role {
   role_arn = "<sub-project specific role arn>"
   session_name = "${var.atlantis_user}"
 }
}

The same rule applies to accessing remote state:

terraform {
 backend "s3" {
   ...
   role_arn = "<state access role arn>"
   ...
 }
}

Assumed roles allowed us to explicitly define and implement cross-account dependencies in Terraform and gave us the possibility of granular control of what actions and resources are allowed within what scope as well as who can actually execute Terraform actions.

Secrets

This is probably the problem that pops up most frequently on many layers in the IT world and again our answer was to utilize what AWS provides and Terraform handles well – KMS encrypted secure strings used directly or via AWS Systems Manager Parameter Store. The secret itself can be defined in Terraform the following way (aws-cli must be used for encrypting the string with the proper KMS key):

data "aws_kms_secret" "hidden" {
 secret {
   name = "text"
   payload = "<KMS encrypted payload>"
 }
}

That resource can be used directly, but in case it should appear in more than one place or in a different sub-project we store it in a Parameter Store:

resource "aws_ssm_parameter" "hidden-text" {
 name = "/hidden/text"
 type = "SecureString"
 value = "${data.aws_kms_secret.hidden.text}"
}

Addressing a secret from a parameter store by ‘name’ can be easily used in any place in your code as long as there is access to the KMS key used for encryption. Unencrypted secrets are stored in state files, but these are server-side encrypted on S3.

Workflow

It was not an option for us to let everyone assume any Terraform related role to run Terraform locally. This would create a huge mess of non-master changes being applied to the infrastructure not mentioning huge risk of breaking things in a big way. This was the moment when Atlantis helped us a lot. This simple yet powerful service consumes Github webhooks and reacts to PR comments like ‘atlantis plan’ by executing Terraform actions in the background, printing output back as a comment. It provides PR level locking which assures that no concurrent changes are triggered until current plan is applied and merged into master. Changes can be applied only when the PR is approved. This workflow has a lot in common with best software development practices and is clear for everyone interested.

Impact

We started using Terraform half a year ago and so far our code has about 300 closed PRs created by 25+ developers not working directly with infrastructure. We prefer to make changes via code rather than using AWS console. All of our Amazon services can be now covered with Terraform and this is what we’re aiming at. We’ve also started using Terraform code for describing non-AWS infrastructure parts like Datadog monitors or Github organizations. These few months with Terraform effectively increased our overall infrastructure awareness at Base.

What’s next?

Even though the Terraform project was a success and the adoption is satisfying, there are plenty of things we’d love to improve. We want to have ready-to-use versioned modules which will allow us to achieve 100% infrastructure as Code coverage. We want to integrate our Terraform setup with an external configuration management solution to share a state with other tools we use. Testing Terraform code is still a mystery for us as well as CI/CD for it. Last but not least we want to implement a system which is capable of detecting Terraform state drift caused by either manual changes to the infrastructure or executing Terraform plans from branches different than the master.

Summary

This article was written to share some of our experiences with adopting Terraform at Base and making it work at scale. We believe in the Infrastructure as Code concept and love automation. Terraform simplicity and ease of use are beautiful but might be problematic at the same time, so it’s important to address all of the potential issues in the very beginning of the project to keep development pace and code quality high! KEEP CALM AND TERRAFORM ALL THE THINGS :)

Posted by

Szymon Władyka

Share this article