On "AWS Best Pracices for DDOS Resiliency"

2019 Update: AWS has updated the whitepaper since this post was written, and has introduced a few new features that make life easier - like NAT Gateways and a less-bad interface to CloudWatch logs. Please cross-check this post with more up-to-date resources before making decisions.

Amazon recently published a set of best practices for building your AWS environment in a way to handle DDOS attacks. While the document is a great start and should be required reading when designing an environment, it's lacking in a few meaningful places - each section corresponds to a guideline of the same name.

Minimize the Attack Surface Area

Or put another way, "put the same diligence into designing your AWS network as you would a physical network." This is a great point and one which shouldn't be ignored, however the only pieces of advice in this section boil down to Use a VPC and Use one of the four VPC scenarios, which are as standard AWS advice as they'll give you anywhere.

So which scenario should you use?

AWS will rarely answer this question directly, because answering it in a way that makes sense for all systems would be really hard. But if you make a few simplifying assumptions - like I will momentarily - the answer becomes much easier. We'll assume:

You're building a pretty standard web application - some app servers, a few workers for async processing, a cache or two, and a database.
You don't need a hardware VPN, because your entire application will live inside AWS (or the parts outside can be reached via HTTPS/TLS).

That leaves scenarios one and two, which could be summarized as Give everything a public IP^[1] and Have a private subnet where your app servers & workers don't have public IPs^[2]. In an ideal world, the second solution is better. You can't attack a machine which doesn't have a world-routable IP, you can monitor everything going out of the private network on the NAT, security groups are striaghtforwards, etc. All of these points are true, but they miss one key thing (which I feel AWS does a poor job of highlighting):

AWS NATs aren't first-class objects.

The way they're mentioned in passing and have their own icon in AWS drawings^[3], you could be led to think that they're things you can create in CloudFormation, and maybe you just haven't yet found the docs for them. You'd be wrong if you thought that. AWS NATs are standard EC2 instances which you boot using an AMI provided by AWS, which are really just stock Amazon Linux machines that run a script on boot. AWS does no management of them for you above and beyond what they do on any other instance you launch.

This is troublesome, as it means that upgrading a NAT is now a downtime event for the app servers behind it (if they reach out to any services outside of your network). It also means that if you're going for an environment that manages itself as much as possible, you need to put that NAT instance into a auto-scaling group, and then do some scripting to update the routing tables when an instance boots^[4]. Lastly, if you're in a situation where for compliance (or other) reasons you need to keep an inventory of system components and installed software, these instances will fall under that regime.

So what do I recommend? Sadly, it's what AWS calls Scenario 1: put all your instances into a public subnet, and then use security groups to make sure that only specific boxes accept connections from outside the network. Your attack surface is larger (as everything has public IPs), but you've mitigated that risk with proper controls.

Be Ready to Scale and Absorb the Attack

Their advice here could be summarized as Use ELBs^[5] for your app servers, DNS which won't go down, and a CDN.

The DNS and CDN sections aren't worth much further discussion, aside from the fact that AWS CloudFront is only an OK CDN (I much prefer Fastly), and that there are other DNS providers out there beyond Route53 - although it does play extra-nice with ELBs.

Their advice about using an auto-scaling group behind an ELB is well-intentioned and correct, but mentions only in passing that [additional] instances may increase your costs - which is the biggest risk you face using this setup. Unless you're careful with your auto-scaling rules, an auto-scaling group will keep spinning up and spinning up instances in the face of a DDOS attack - and you'll lose if you try and outspend a DDOS attack.

Although they don't talk about it since it's not their product, this is the natural place where the document should suggest looking into vendors like CloudFlare or Incapsula to handle this for you.

Safeguard Exposed Resources

This section is mostly a grab-bag of other product offerings AWS would like you to use.

There's an extensive section about deploying web application firewalls on here, but that suffers from the same risks discussed earlier around trying to outspend a DDOS attach - if you put WAF nodes into an auto-scaling group (as it suggests), you'll spend all your money scaling up that group...as opposed to the app servers. Same result (the attacker makes you spend more, triggered by an auto-scaling rule), but a different and more expensive one, as WAF AMIs generally charge by the hour.

Again, the best advice to safeguard exposed resources would be to put them behind something like CloudFlare, so they're exposed to someone else's network and not yours.

Learn Normal Behavior

You should definitely do this, but I've never been satisfied by CloudWatch, their suggested way of doing it.

Does CloudWatch record all the things it says it records? Yes.
Will you have a good time dealing with that data? Nope.

If you're serious about understanding what your system is doing at any point in time, you should find a vendor/product that just focuses on that - they are both hard enough domains as-is. For performance monitoring, I love New Relic - it's not cheap, but monitoring instance health is the only thing they focus on, and I've never not felt like I've gotten positive ROI on it (with room to spare). For application-level logging I prefer LogEntries, but try the demos in this area - there are a few distinct approaches to this problem, and you should pick the one that matches your thinking.

Having said all that, I'll applaud AWS for saying you should put an alert on the EstimatedCharges metric, which tracks your estimated monthly bill. If you do nothing else, do this right now.

Create a Plan for Attacks

This is a great idea, but not in the way they're suggesting.

Sure, having an AWS TAM^[6] will help a little when you're in the midst of an attack (and paying for business-level support is a great investment, especially when getting started), the biggest return will come from having a plan in place before things happen, and having tested it a few times in a tabletop exercise. How this will work is very organization-specifc, but Heroku's incident response framework is a great start.

Conclusion

Building a system that's completely DDOS-resistant is impossible, and likely not worth the time even if it was.

As opposed to trying to beat script kiddies all day every day, you're much better off accepting that a successful DDOS attack^[7] will eventually happen, and coming up with a plan about what to do when that day comes. I'd start with making sure that all parts of your organization can agree on answers to the following questions, focusing on when you're being attacked:

Who's in control internally, and what do we expect of them?
How much of an attack do we expect to be able to stand up against?
How open will we be with our users, and what will their experience be?
How can other parts of the organization support the effort?

It's worth highlighting that none of the points above have much to do with technology, which is intentional. DDOS attacks are scary and stressful times, and your societal/interpersonal problems will be the dominant factors. Get those right - with practice, honestly, and communication - and you'll have the space needed to deal with the technical ones.

Scenario 1: VPC with a Public Subnet Only ↩︎
Scenario 2: VPC with Public and Private Subnets ↩︎
Like this one, from scenario 2 ↩︎
There's a CloudFormation template available to build an HA NAT, but it's always seemed like overkill to me ↩︎
Elastic Load Balancers; the AWS managed load balancer product ↩︎
Technical Account Manager; the person who manages your relationship with AWS paid support ↩︎
Or any kind, really - the advice doesn't change ↩︎