Zero-downtime deploys - the ability to release a new version of your code to production without taking the site down - are a key component of continuous delivery.

In the early days of AWS, this was a pain to do; auto-scaling groups (ASGs) didn't play that well with ELBs, and you often ended up having to build tooling to do what seemed like a simple operation - "Update the instances in this ASG to use the launch configuration it now has". In theory, this changed with the release of the UpdatePolicy attribute on an ASG. From the blog post:

Today’s new feature allows you to perform a rolling deployment of an Auto Scaling Group within a CloudFormation stack. Instead of updating all of the instances in a group at the same time, you can now replace or modify the instances in a step-by-step fashion...This feature will increase availability of your application during an update.

The part before the ellipsis - updating instances in a step-by-step fashion - works quite well. The part after - increasing availability - doesn't, unless you use this in a non-obvious way. Before explaining why, I'll step back a second and explain how I expected this to work.

Let's assume I'm building a feature to increase availability during a deploy. My building blocks are:

  1. A load-balancer with a robust system health-check mechanism
  2. A scaling system which knows how to use the health-check of a load balancer to decide if an instance it's managing is healthy or not
  3. A way to configure the scaling system to gradually add new instances into the scaling pool, and rotate the old ones out
  4. A desire to build something called Rolling Deployments of Auto Scaling Groups

My expectation would be that the scaling system (#2) would make full use of the health-check (#1) when deciding if a new instance has been launched successfully, before moving on (#3) to launching the next one. This would be a rolling deploy (#4), and one that works very hard to "increase availability during an update".

In concrete terms: when rolling the contents of an ASG in an operation governed by an UpdatePolicy, I'd expect the ASG to use the full power of the ELB's health-check to decide when to move on to the next instance.

This doesn't happen. If expect it to, you'll likely downtime when doing a CloudFormation UPDATE which changes the AMI used in an ASG.

It works how?

What happens instead is that the ASG adds the instance to the ELB immediately on it getting into the InService state (meaning the instance has booted). The ASG then ignores its own setting for HealthCheckGracePeriod, and starts counting the health-check failures. This means that if the time your health-check will tolerate a "bad" instance1 is less than the time it takes for your instance to be ready to serve requests, the ASG will drop that newly-created instance. While this is happening, the UpdatePolicy rolling update will continue - which could very easily leave you with no working instances in the ELB.

As AWS support puts it:

Although you are using an ELB health-check, unfortunately as part of the rolling update process CloudFormation does not check that the instance has been marked as in-service behind the ELB. I do agree that when an environment is configured to use an ELB health-check, just relying on the fact that the instance has been added to the ASG is not a sufficient success criteria for rolling updates. There is currently a feature request logged with regard to this...there is currently no ETA.

And subsequently:

As soon as that new instance is started up and marked as in service in the AutoScaling group (fairly quickly), the ELB starts its health-checks.

What makes this even more frustrating is that all these pieces work together perfectly on non-CloudFormation auto-scaling operations. If one of my instances crashes the ASG will spin up a new one, add it to the ELB, and respect the HealthCheckGracePeriod during start-up.

Way to violate the principle of least-surprise there, AWS.

What now?

Getting past the disappointment this doesn't work the way I'd expect, we still need a zero-downtime deploy solution.

The easiest thing to do is to extend the value for PauseTime in your UpdatePolicy to longer than it takes for an instance to move from a "booted" to "ready to serve web requests". Here, PauseTime governs the wait between adding in new instances and removing the old ones. That transition from "booted" to "serving" still needs to take less time than your "bad-instance" timeout, but if it doesn't, you'll still have the old instances running to serve requests that come in.

The "official" solution is to use WaitOnResourceSignal, which pauses the rolling update until the instance uses cfn-signal to signal that it's in a good state - here's their example template.

In closing

This whole piece could be summed up with "it's bad that CloudFormation UpdatePolicy rolling updates don't respect the ELB health check", and it really is. I can't imagine the product decision that got AWS here, and I'm not staying up late to wait for the fix.

There is something funny in this, though. If you read the example template for how to work around this with cfn-signal, their example cloud-init configuration won't sent the OK signal until...a check on instance's ELB health (using the AWS CLI) comes back successful.

It's almost as if they knew how this whole thing should fit together, and then built something else.