The first step of using AWS is to build your servers and your network by hand, using the management console. This works fine...at first. Then you'll end up forgetting how to build a certain pet, or your colleague will click the wrong option and take your system down, or you'll want to have two of something, and you'll realize that it's about time you control this mess in code.

At this point, you'll noticed that AWS has several products for this, which seem to all do almost the same thing. You've got:

  • CloudFormation, which is the combination of a JSON DSL to create templates and a service to realize them
  • OpsWorks, which is Chef with some AWS resources added into it
  • Elastic Beanstalk, which is AWS' version of Heroku

But you've also got Terraform, which is unlike the others in that it's not built by AWS.

I've been using CloudFormation for about a year and a half, having built my own set of libraries and tooling using Troposphere - so I'm working in Python, and then converting to JSON at the end. I've been happy with this, but several of my colleagues have gotten excited about Terraform - which suggested it was time to give it another look.

Having spent a day rebuilding a simplified version of our current infrastructure on it, some thoughts:

The Good

  • There was a lot less copy/paste than I expected. Python's list comprehensions are terse enough that I can condense many networking rules down into a few lines. While that's not an option here (no loops in a declarative language), the copy/paste required to get the same rules wasn't that annoying, and would be hidden behind the first layer of modules.
  • Terraform's ability to undo things you've done in the console is outstanding. One of the weaknesses of CloudFormation is that it ignores any manual changes you've made to the infrastructure once it's been launched, whereas Terraform lets you reset any hand-configured settings back to what the code says.
  • New AWS products (like NAT Gateways) get added into Terraform before CloudFormation, since Terraform uses their vanilla APIs. On the downside, there are a few AWS features that aren't exposed outside of CloudFormation - so Terraform can't use them.

The OK

  • One of the big gripes we have with our in-house libraries is that building things takes a lot of proprietary knowledge. Once you've built up a tier of in-house Terraform modules, though, I suspect you'd end up in the same place.
  • The documentation is about the same level as CloudFormation's, although I'd often find myself looking at the docs for the matching AWS function to really understand what a particular argument meant.
  • Most of the cases where we change the system between production and development seem like they could be handled with variables. There are a few that don't have obvious solutions, but that could also be a sign that we shouldn't be doing that.
  • The workflow (and some of the issues around apply) strongly suggests that Terraform is designed for blue-green deploys, which aren't always the right answer.
  • Terraform uses the full "path" of a module to track it, meaning that refactoring some code that doesn't change the underlying infrastructure will cause a change so that Terraform can "keep up". Coming from Troposphere, I was used to the code being an abstraction that generated a plan at the end - as opposed to the code being the plan.

The Bad

  • Several times I found that I had created a system that I couldn't destroy, which centered around this issue. Even following the work-around in there - which seems like a hack - destroy often found cycles which made it necessary for me to manually destroy bits of the system through the console.
  • The design of the module system (and tooling - I'm looking at you, terraform get) gets frustrating when you're trying to use them for information-hiding and internal code-reuse, as opposed to for sharable units.
  • Not being able to pass maps to modules got old. I had hoped to use a pattern of passing a mapping (e.g. "environment type to instance size") and a variable (e.g. "environment type") to modules and letting them figure out/hide which value to use; instead, this had to be done at the caller level.
  • When there's an error executing the plan Terraform generates - which happened more than I expected - it doesn't roll back what it was trying to do. This is intentional (and clearly stated) and while I like it intellectually, it puts me at a higher risk of leaving my production system in a half-changed state, and my having to troubleshoot while I'm down. This is doubly worrying because the most common suggestion for how to fix this is "just run it again and see what happens". I much prefer CloudFormation's automated rollbacks, guaranteeing I end up in a known-good state on a failed change.
  • Terraform doesn't support UpdatePolicy, so you need to use a bit of a hack to gracefully rotate new instances into an auto-scaling group. This isn't their fault - that API isn't exposed by AWS - but their solution isn't great, and doesn't really work on auto-scaling groups that aren't attached to load balancers.
  • I'm still not sure how I'd handle stateful things like RDS instances in a Terraform world. There are all sorts of reports that this litters your .tfstate with credentials (although I didn't try it personally), and I'm not confident that the planner wouldn't one day try and destroy the instance as a side-effect to some innocuous change.

So?

After all that, I'm still glad I spent the day on Terraform. I used to think it was a bad idea; now I think it's a good idea, albeit one that's not yet ready.

The product is miles better than when I tried it a year ago, and it has - when they work - some seriously compelling features. But would I use it in production to manage a network? Not yet.