AWS High Availability on NAT instances

If your Amazon Web Services VPC has instances in a private subnet that requires accessing the internet, you may be using a NAT instance with the VPC route table (containing a route for 0.0.0.0/0 pointing to the NAT instance). Have you considered the possibility of this creating a single point of failure?

Consider the following scenario: you have all your instances mirrored for high availability, across two availability zones. Even with a NAT instance in each availability zone, the 0.0.0.0/0 route to the NAT instance is a single point of failure.

So how to reduce this risk? If most of your outgoing traffic consists of simple http requests, such as running upgrading Operating System packages, one solution is to use proxy servers. By using Squid (or Tinyproxy) running on an instance in the public subnet in both availability zones, then adding an internal Elastic Load Balancer, you provide high availability for your outgoing traffic.

AWS NAT High Availability with Heartbeat

However, if you need NATing for production-critical applications where a proxy server isn’t enough for outgoing traffic, AWS has come up with a simple shell script that can achieve NAT high availability. It works with the two NAT instances (one in each availability zone) pinging each other. If one doesn’t respond, the active NAT instance takes over the route for 0.0.0.0/0 in the availability zone in which the NAT has failed, then attempts to reboot the unresponsive instance. It uses the ec2 api tools and IAM roles and requires each availability zone to use its own route table for the private subnets.

This is how a high level diagram would look like:

aws_nat

The steps are described in the article at http://aws.amazon.com/articles/2781451301784570

Some Improvement tips

Ensure you execute the script with bash instead of sh otherwise the conditional expressions will fail.

It is preferable to run the script as an upstart script rather than a cronjob, so you can easily stop the service when you need to perform maintenance and reboot a NAT instance. Puppet also works better with scripts running as a service so it can ensure they are always running.

Conclusion

There are many tweaks that can be done to improve the script. AWS will reportedly be launching a highly available NAT service in the future but for now this script does the job.