Yesterday, it seemed half the internet was down.  And yet we were able to maintain Disaster Preparedness systems and most services through the AWS outage.   Since the impacted S3 was a key component, it was for us like flying a jet on one engine through a storm.

So how were we able to keep our clients running through all the pandemonium?  And how will your business fare with us in the event of a major disaster?  And what can we do to improve our capabilities to keep you running when everything goes down?

The first thing we want to point out is that the Amazon Web Services (AWS) cloud infrastructure is actually divided among 14 Availability Zones in 16 globally-distributed Regions.   Yesterday’s outage occurred in the US-East-1 Region, in Northern Virginia.   The Simple Storage Service (S3) experienced interruptions there.  Since this is a major hub of activity for the eastern U.S. and S3 storage is utilized by not only other AWS service, but also a large number of hosting backend operations.  An estimated 150,000 websites and applications were experiencing anything from slower access to going completely down for several hours.

AWS Global Infrastructure

AWS Global Infrastructure

Since we’re a California company, most of our clients are based in the western U.S.  Some applications use services at US-East-1, but since our policy (and best practice) is to deploy cloud resources close to the client’s primary markets, most of the AWS usage we manage is outside of that region.

Still Reliable After All These Years

We’ve also decided to make more recommendations for Cross-Region Replication as an option for clients requiring high levels of accessibility.  However, we should point out that the standard S3 service is guaranteed to have a 99.99% availability, backed by one of the soundest Service Level Agreements in the industry.

CloudHarmony independently monitors cloud services and they indicate that S3 nearly always has 100% availability, with yesterday and the August 2015 outage being exceptions.  You can see the latest status for cloud providers on their site at: https://cloudharmony.com/status-of-storage

It is also important to know that yesterday’s outage was not only rare, but also simply a matter of availability.  No data was lost, thanks to the fact that S3 stores your data on multiple physical devices across multiple facilities in any given region.  The data is constantly checked, but the high level of redundancy also serves to protect your data from even total failure of entire data centers.

So the issue yesterday was not so much the loss of data than not being able to get to it.  Certainly not a minor thing, particularly if you are located in the Eastern U.S.

And of course, there were 15 other regions that were not experiencing the outage.

Other Services Impacted

Since we are involved with far more than data storage, other services experienced various degrees of degradation.  For us, marketing services were most highly impacted.  With Adobe utilizing S3 for storage of much of our stock image library, our ability to purchase and download images was impaired.  Also online Creative Cloud library access was limited, but the user libraries are synced with local copies, so local versions of the files were fully available and updates still synced, though at a slower rate.

Constant Contact email image hosting, plus our ability to automate social media posts were painstakingly slow or non-functional.   Yet we were still able to send update bulletins to clients yesterday.  And we simply did manual social posting.

For websites that we manage, hosting services for non-media intensive sites were generally not affected, unless the site was hosted on AWS.  However, since a lot of content distribution for the Eastern U.S. utilizes AWS, there were serious problems with viewing production sites for visitors in that area.

Bullet-Proofing the Backups

Because recovery backups are run outside of business hours, the outage did not impact that aspect of our disaster recovery systems.   We were preparing to switch backups to Microsoft Azure, if S3 were to remain down after 5 pm PST.

We wouldn’t call our service Disaster Continuity if we didn’t have contingency plans for contingency plans!

But since we almost always recommend a hybrid system combining local storage and cloud backups, the worst-case scenario would be that the current day’s data would have local-only backups.   And the prior day’s backups would still be in the cloud, in case of any need to totally recover the data.

Disaster Preparedness Systems PROVEN During Disasters

Many of our Disaster Preparedness systems were not only developed in, but also used by clients during catastrophic natural disasters.  We’ve kept operations functioning in the aftermath of devastating earthquakes and storms, including the most powerful cyclonic storm to hit land.  Through many of these events and more smaller ones, Amazon Web Services was a key vendor that has allowed us to provide continuity and safety for our clients and their customers.   Yesterday was indeed a rare occasion and as the details emerge on this unfortunate occurrence, we know that AWS will do whatever is necessary to prevent it from happening again.

Keep Flying

But no matter what, we have been greatly encouraged by our ability to maintain services during this crisis.   S3 storage is an important part of our Disaster Preparedness systems, plus an integral component of much of what we do.   Still, we didn’t have to stop working.  More importantly, most of our clients would not even have noticed, if we hadn’t posted and emailed updates.  Considering how much S3 powers it all, #KeepYouWORKING honestly worked better than we expected, given the circumstances.

We kept the plane of our operation in the air…and we made it safely through this storm with the equivalent of only one engine.

Yup, another disaster (albeit man-made this time) and we got everyone there in one piece.

 

 

 

 

Disaster Preparedness Systems - through the storm on just one engine.