So how did Tuesday’s Amazon AWS S3 outage occur? Why was such a large part of the internet down? Amazon just released an official summary of the incident.
Basically, an authorized member of the S3 team was working on debugging a slowdown with their billing system. Following their procedures, a command was entered to remove a small number of servers from the process. Unfortunately, the command was not entered correctly and a larger set of servers that handle file handling requests were removed. And thus all the websites and applications that needed to process their stored files (plus all the AWS services that rely upon this) could not work properly. The net effect on services reliant upon the Amazon Northern Virginia data center had ripple effects throughout the net.
In order to correct the issue, multiple subsystems had to be restarted and every time a subsystem restarted, there was some impact on overall functionality.
Amazon had systems in place to isolate failures of this nature, but since they had grown quickly, what was once a small subsystem was actually much larger. According to the summary, they will be working to restructure these subsystems, so that any future disruptions will be limited in scope. In addition, safeguards have already been added to essentially throttle any capacity reductions, plus prevent reductions that impact other subsystems.
AWS will also be improving the recovery time of their subsystems. Planned work to isolate points of failure has been prioritized to begin immediately.
Also, since the Service Health Dashboard was impacted by the failures, AWS will distribute the administration of the dashboard across several of their 16 worldwide regions. A lot of people noticed how the dashboard showed that S3 was operating normally during the early part of this crisis, which created a great deal of amusement on social media. With the dashboard running from multiple regions, this will eliminate this issue.
In our previous post on how our systems were able to work through the outage, we had expressed our faith in the ability of Amazon Web Services to correct the problems behind this event. As we had mentioned, AWS is still one of the more reliable cloud services, with usual 100% availability. You can monitor and compare the major cloud services here: https://cloudharmony.com/status-of-storage.
With the release of the summary from Amazon, we are satisfied that they will continue to maintain their level of reliability and scalability. We plan to continue to use and resell their services. But as, many pointed out during and since this event, the use of public cloud is a great solution, particularly for extending the capabilities and controlling the costs of small businesses and organizations. Yet a good system needs to account for the inevitable failures, so using hybrid solutions is an option that everyone must consider. This approach is at the heart of our Disaster Preparedness systems.
We hope that whatever you use also has some combination of cloud and on-premise deployment. And certainly see to it that your computing capabilities are not dependent upon any single vendor or resources in a single physical location. Murphy did say that, “If anything can go wrong, it will.”
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
CloudHarmony Service Status
(real-time monitoring of major cloud service providers, filtered for storage services like AWS S3)
Our Disaster Preparedness Systems Survive the AWS Outage
(how we were able to keep client services running through the event)