A Stroll Through the InformationWeek Archives
The cloud is rising, however cloud outages are nothing new. And neither are we. InformationWeek was first based in 1985 and our on-line archives return to 1998. Here are only a few lowlights from the cloud’s worst moments, dug up from our archives.
Apr. 17, 2007 / In Web 2.0 Keynote, Jeff Bezos Touts Amazon’s On-Demand Services, by Thomas Claburn — “Asked by conference founder Tim O’Reilly whether Amazon was making any money on this, Bezos answered, ‘We certainly intend to make money on this,’ before finally admitting that AWS wasn’t profitable today.”
(As a reminder, people, right here in 2022, AWS is now value a trillion {dollars}. )
Aug. 12, 2008 / Sorry, the Internets are Broken Today, by Dave Methvin and Google Apologies for Gmail Outage, by Thomas Claburn — After a string of clunky disruptions throughout Microsoft MSDN boards, Gmail, Amazon S3, GoToMeeting, and SiteMeter, Methvin laments: “When you use a third-party service, it becomes a black box that is hard to verify, or even know if or when something has changed. Welcome to your future nightmare.”
Oct. 17, 2008 / Google Gmail Outage Brings Out Cloud Computing Naysayers, by Thomas Claburn — Because the outage seems to have lasted greater than 24 hours for some, affected paying Gmail clients seem like owed service credit, as per the phrases of the Gmail SLA. As one buyer stated: “This is not a temporary problem if it lasts this long. It is frustrating to not be able to expedite these issues.”
June 11, 2010 / The Cloud’s Five Biggest Weaknesses, by John Soat — “The recent problems with Twitter (“Fail Whale”) and Steve Jobs’ embarrassment at the network outage at the introduction of the new iPhone don’t exactly impart warm fuzzy feelings about the Internet and network performance in general. An SLA can’t guarantee performance; it can only punish bad performance.”
[In 2022, a cloud SLA can accomplish basically nothing at all. As Richard Pallardy and Carrie Pallardy wrote this week, “Industry standard service level agreements are remarkably restrictive, with most companies assuming little if any liability.”]
April 21, 2011 / Amazon EC2 Outage Hobbles Websites, by Thomas Claburn / April 22, 2011 / Cloud Takes a Hit, Amazon Must Fix EC2, by Charles Babcock / April 29, 2011 / Post-Mortem: When Amazon’s Cloud Turned on Itself, by Charles Babcock — The “Easter Weekend” Amazon outage that impacted Yard, Foursquare, Hootsuite, Heroku, Quora, and Reddit amongst others. Babcock writes: “In constructing excessive availability into cloud software program, we have escaped the confines of {hardware} failures that introduced operating methods to a halt. In the cloud, the {hardware} might fail and every part else retains operating. On the different hand, we have found that we have entered a better ambiance of operations and bigger aircraft on which potential failures might happen.
“The new architecture works great when only one disk or server fails, a predictable event when running tens of the thousands of devices. But the solution itself doesn’t work if it thinks hundreds of servers or thousands of disks have failed all at once, taking valuable data with them. That’s an unanticipated event in cloud architecture because it isn’t supposed to happen. Nor did it happen last week. But the governing cloud software thought it had, and triggered a massive recovery effort. That effort in turn froze EBS and Relational Database Service in place. Server instances continued running in U.S. East-1, but they couldn’t access anything, more servers couldn’t be initiated and the cloud ceased functioning in one of its availability zones for all practical purposes for over 12 hours.”
Aug. 9, 2011 / Amazon Cloud Outage: What Can Be Learned? by Charles Babcock — A lightning strike in Dublin, Ireland knocked Amazon’s European cloud companies offline Sunday and a few clients have been anticipated to be down for as much as two days. (Lightning will make an look in different outages in the future.)
July 2, 2012 / Amazon Outage Hits Netflix, Heroku, Pinterest, Instagram, by Charles Babcock — Amazon Web Services information middle in the US East-1 area loses energy due to violent electrical storms, knocking out many web site clients.
July 26, 2012 / Google Talk, Twitter, Microsoft Outages: Bad Cloud Day, by Paul McDougall / July 26, 2012 / Microsoft Investigates Azure Outage in Europe, by Charles Babcock / March 1, 2012 / Microsoft Azure Explanation Doesn’t Soothe, by Charles Babcock — Google reported its Google Talk IM and video chat service was down in elements of the United States and throughout the globe on the identical day Twitter was additionally offline in some areas, and Microsoft’s Azure cloud service was out throughout Europe. Microsoft chief’s autopsy on Azure cloud outage cites a human error issue, however leaves different questions unanswered. Does this remind you of how Amazon performed its earlier lightning strike incident?
Oct. 23, 2012 / Amazon Outage: Multiple Zones a Smart Strategy, by Charles Babcock — Traffic in Amazon Web Services’ most closely used information middle advanced, U.S. East-1 in Northern Virginia, was tied up by an outage in considered one of its availability zones. Damage management obtained underway instantly however the results of the outage have been felt all through the day, stated Adam D’Amico, Okta’s director of technical operations. Savvy clients, resembling Netflix, who’ve made a serious funding in use of Amazon’s EC2, can typically keep away from service interruptions by utilizing a number of zones. But as reported by NBC News, some Netflix regional companies have been affected by Monday’s outage.
Okta’s director of technical operations informed Babcock that they use all 5 zones to hedge towards outages. “If there’s a sixth zone tomorrow, you can bet we’ll be in it within a few days.”
Jan 4, 2013 / Amazon’s Dec. 24 Outage: A Closer Look, by Charles Babcock — Amazon Web Services as soon as once more cites human error unfold by automated methods for lack of load balancing at key facility Christmas Eve.
Nov. 15, 2013 / Microsoft Pins Azure Slowdown on Software Fault, by Charles Babcock — Microsoft Azure GM Mike Neil explains the Oct. 29-30 slowdown and the motive behind the widespread failure.
May 23, 2014 / Rackspace Addresses Cloud Storage Outage, by Charles Babcock — Solid state disk capability scarcity disrupts some Cloud Block storage clients’ operations in Rackspace’s Chicago and Dallas information facilities. Rackspace’s standing reporting service stated the drawback “was due to higher than expected customer growth.”
July 20, 2014 / Microsoft Explains Exchange Outage, by Michael Endler — Some clients have been unable to achieve Lync for a number of hours Monday, and a few Exchange customers went 9 hours Tuesday with out entry to electronic mail.
Aug. 15, 2014 / Practice Fusion EHR Caught in Internet Brownout, by Alison Diana — A variety of small doctor practices and clinics despatched residence sufferers and workers after cloud-based digital well being document supplier Practice Fusion’s web site was a part of a worldwide two-day outage.
Sept. 26, 2014 / Amazon Reboots Cloud Servers, Xen Bug Blamed, by Charles Babcock — Amazon tells clients it has to patch and reboot 10% of its EC2 cloud servers
Dec. 22, 2014 / Microsoft Azure Outage Blamed on Bad Code, by Charles Babcock — Microsoft’s evaluation of Nov. 18 Azure outage signifies engineers’ determination to extensively deploy misconfigured code triggered main cloud outage.
Jan. 28, 2015 / When Facebook’s Down, Thousands Slow Down, by Charles Babcock — When Facebook went down this week, 1000’s of internet sites linked to the social media web site additionally slowed down, in keeping with Dynatrace. At least 7,500 Web websites that rely upon a JavaScript response from a Facebook server had their operations slowed or stalled by an absence of response from Facebook.
Aug. 20, 2015 / Google Loses Data: Who Says Lightning Never Strikes Twice? by Charles Babcock — Google skilled excessive learn/write error charges and a small information loss at its Google Compute Engine information middle in Ghislain, Belgium, Aug. 13-17 following a storm that delivered 4 lightning strikes on or close to the information middle.
Sep. 22, 2015 / Amazon Disruption Produces Cloud Outage Spiral, by Charles Babcock — Amazon DynamoDB failure early Sunday set off cascading slowdowns and repair disruptions that illustrate the extremely related nature of cloud computing. A variety of Web firms, together with AirBnB, IMDB, Pocket, Netflix, Tinder, and Buffer, have been affected by the service slowdown and, in some circumstances, service disruption. The incident started at 3 a.m. PT Sunday, or 6 a.m. in the location the place it had the biggest affect: Amazon’s most closely trafficked information middle advanced in Ashburn, Va., also referred to as US-East-1.
May 12, 2016 / Salesforce Outage: Can Customers Trust the Cloud?, by Jessica Davis — The Salesforce service outage began on Tuesday with the firm’s NA14 occasion, affecting clients on the US west coast. And whereas service was restored on Wednesday after almost a full day of down time, the occasion has continued to expertise a degradation of service, in keeping with Salesforce’s on-line standing web site.
March 7, 2017 / Is Amazon’s Growth Running a Little Out of Control? by Charles Babcock — After a five-hour S3 outage in US East-1 Feb. 28, AWS operations explains that it was more durable to restart its S3 index system this time than the final time they tried to restart it.
Writes Babcock: “Given the fact that the outage started with a data entry error, much reporting on the incident has described the event as explainable as a human error. The human error involved was so predictable and common that this is an inadequate description of what’s gone wrong. It took only a minor human error to trigger AWS’ operational systems to start working against themselves. It’s the runaway automated nature of the failure that’s unsettling. Automated systems operating in an inevitably self-defeating manner is the mark of an immature architecture.”
Fast Forward to Today
As Sal Salamone detailed neatly this week, in his piece about classes discovered from latest main outages: Cloudflare, Fastly, Akamai, Facebook, AWS, Azure, Google, and IBM have all had calamities much like this in 2021-22. Human errors, software program bugs, energy surges, automated responses having sudden penalties, all inflicting havoc.
What will we be writing 15 years from now about cloud outages?
Maybe extra of the identical. But you won’t be capable to learn it if there’s lightning in Virginia.
What to Read Next:
Lessons Learned from Recent Major Outages
Can You Recover Losses Sustained During an Outage?
Special Report: How Fragile is the Cloud, Really?