A Stroll By means of the InformationWeek Archives

August 5, 2022

1

The cloud is rising, however cloud outages are nothing new. And neither are we. InformationWeek was first based in 1985 and our on-line archives return to 1998. Listed here are only a few lowlights from the cloud’s worst moments, dug up from our archives.

Apr. 17, 2007 / In Internet 2.0 Keynote, Jeff Bezos Touts Amazon’s On-Demand Companies, by Thomas Claburn — “Requested by convention founder Tim O’Reilly whether or not Amazon was making any cash on this, Bezos answered, ‘We definitely intend to earn a living on this,’ earlier than lastly admitting that AWS wasn’t worthwhile at this time.”

(As a reminder, of us, right here in 2022, AWS is now value a trillion {dollars}. )

Aug. 12, 2008 / Sorry, the Internets are Damaged In the present day, by Dave Methvin and Google Apologies for Gmail Outage, by Thomas Claburn — After a string of clunky disruptions throughout Microsoft MSDN boards, Gmail, Amazon S3, GoToMeeting, and SiteMeter, Methvin laments: “Once you use a third-party service, it turns into a black field that’s onerous to confirm, and even know if or when one thing has modified. Welcome to your future nightmare.”

Oct. 17, 2008 / Google Gmail Outage Brings Out Cloud Computing Naysayers, by Thomas Claburn — As a result of the outage seems to have lasted greater than 24 hours for some, affected paying Gmail prospects look like owed service credit, as per the phrases of the Gmail SLA. As one buyer mentioned: “This isn’t a short lived downside if it lasts this lengthy. It’s irritating to not be capable of expedite these points.”

June 11, 2010 / The Cloud’s 5 Largest Weaknesses, by John Soat — “The current issues with Twitter (“Fail Whale”) and Steve Jobs’ embarrassment on the community outage on the introduction of the brand new iPhone do not precisely impart heat fuzzy emotions in regards to the Web and community efficiency usually. An SLA cannot assure efficiency; it could actually solely punish unhealthy efficiency.”

[In 2022, a cloud SLA can accomplish basically nothing at all. As Richard Pallardy and Carrie Pallardy wrote this week, “Industry standard service level agreements are remarkably restrictive, with most companies assuming little if any liability.”]

April 21, 2011 / Amazon EC2 Outage Hobbles Web sites, by Thomas Claburn / April 22, 2011 / Cloud Takes a Hit, Amazon Should Repair EC2, by Charles Babcock / April 29, 2011 / Publish-Mortem: When Amazon’s Cloud Turned on Itself, by Charles Babcock — The “Easter Weekend” Amazon outage that impacted Yard, Foursquare, Hootsuite, Heroku, Quora, and Reddit amongst others. Babcock writes: “In constructing excessive availability into cloud software program, we have escaped the confines of {hardware} failures that introduced working methods to a halt. Within the cloud, the {hardware} could fail and all the things else retains working. Alternatively, we have found that we have entered a better ambiance of operations and bigger aircraft on which potential failures could happen.

“The brand new structure works nice when just one disk or server fails, a predictable occasion when working tens of the 1000’s of units. However the answer itself would not work if it thinks tons of of servers or 1000’s of disks have failed all of sudden, taking useful information with them. That is an unanticipated occasion in cloud structure as a result of it is not imagined to occur. Nor did it occur final week. However the governing cloud software program thought it had, and triggered an enormous restoration effort. That effort in flip froze EBS and Relational Database Service in place. Server cases continued working in U.S. East-1, however they could not entry something, extra servers could not be initiated and the cloud ceased functioning in certainly one of its availability zones for all sensible functions for over 12 hours.”

Aug. 9, 2011 / Amazon Cloud Outage: What Can Be Discovered? by Charles Babcock — A lightning strike in Dublin, Eire knocked Amazon’s European cloud providers offline Sunday and a few prospects have been anticipated to be down for as much as two days. (Lightning will make an look in different outages sooner or later.)

July 2, 2012 / Amazon Outage Hits Netflix, Heroku, Pinterest, Instagram, by Charles Babcock — Amazon Internet Companies information middle within the US East-1 area loses energy due to violent electrical storms, knocking out many web site prospects.

July 26, 2012 / Google Speak, Twitter, Microsoft Outages: Dangerous Cloud Day, by Paul McDougall / July 26, 2012 / Microsoft Investigates Azure Outage in Europe, by Charles Babcock / March 1, 2012 / Microsoft Azure Clarification Does not Soothe, by Charles Babcock — Google reported its Google Speak IM and video chat service was down in elements of the USA and throughout the globe on the identical day Twitter was additionally offline in some areas, and Microsoft’s Azure cloud service was out throughout Europe. Microsoft chief’s autopsy on Azure cloud outage cites a human error issue, however leaves different questions unanswered. Does this remind you of how Amazon performed its earlier lightning strike incident?

Oct. 23, 2012 / Amazon Outage: A number of Zones a Good Technique, by Charles Babcock — Visitors in Amazon Internet Companies’ most closely used information middle advanced, U.S. East-1 in Northern Virginia, was tied up by an outage in certainly one of its availability zones. Harm management received underway instantly however the results of the outage have been felt all through the day, mentioned Adam D’Amico, Okta’s director of technical operations. Savvy prospects, corresponding to Netflix, who’ve made a significant funding in use of Amazon’s EC2, can typically keep away from service interruptions by utilizing a number of zones. However as reported by NBC Information, some Netflix regional providers have been affected by Monday’s outage.

Okta’s director of technical operations informed Babcock that they use all 5 zones to hedge towards outages. “If there is a sixth zone tomorrow, you possibly can wager we’ll be in it inside just a few days.”

Jan 4, 2013 / Amazon’s Dec. 24 Outage: A Nearer Look, by Charles Babcock — Amazon Internet Companies as soon as once more cites human error unfold by automated methods for lack of load balancing at key facility Christmas Eve.

Nov. 15, 2013 / Microsoft Pins Azure Slowdown on Software program Fault, by Charles Babcock — Microsoft Azure GM Mike Neil explains the Oct. 29-30 slowdown and the explanation behind the widespread failure.

Might 23, 2014 / Rackspace Addresses Cloud Storage Outage, by Charles Babcock — Strong state disk capability scarcity disrupts some Cloud Block storage prospects’ operations in Rackspace’s Chicago and Dallas information facilities. Rackspace’s standing reporting service mentioned the issue “was resulting from increased than anticipated buyer progress.”

July 20, 2014 / Microsoft Explains Trade Outage, by Michael Endler — Some prospects have been unable to achieve Lync for a number of hours Monday, and a few Trade customers went 9 hours Tuesday with out entry to e mail.

Aug. 15, 2014 / Follow Fusion EHR Caught in Web Brownout, by Alison Diana — Quite a lot of small doctor practices and clinics despatched house sufferers and workers after cloud-based digital well being file supplier Follow Fusion’s website was a part of a worldwide two-day outage.

Sept. 26, 2014 / Amazon Reboots Cloud Servers, Xen Bug Blamed, by Charles Babcock — Amazon tells prospects it has to patch and reboot 10% of its EC2 cloud servers

Dec. 22, 2014 / Microsoft Azure Outage Blamed on Dangerous Code, by Charles Babcock — Microsoft’s evaluation of Nov. 18 Azure outage signifies engineers’ choice to broadly deploy misconfigured code triggered main cloud outage.

Jan. 28, 2015 / When Fb’s Down, Hundreds Gradual Down, by Charles Babcock — When Fb went down this week, 1000’s of internet sites linked to the social media website additionally slowed down, in accordance with Dynatrace. At the very least 7,500 Web pages that rely upon a JavaScript response from a Fb server had their operations slowed or stalled by an absence of response from Fb.

Aug. 20, 2015 / Google Loses Information: Who Says Lightning By no means Strikes Twice? by Charles Babcock — Google skilled excessive learn/write error charges and a small information loss at its Google Compute Engine information middle in Ghislain, Belgium, Aug. 13-17 following a storm that delivered 4 lightning strikes on or close to the info middle.

Sep. 22, 2015 / Amazon Disruption Produces Cloud Outage Spiral, by Charles Babcock — Amazon DynamoDB failure early Sunday set off cascading slowdowns and repair disruptions that illustrate the extremely linked nature of cloud computing. Quite a lot of Internet corporations, together with AirBnB, IMDB, Pocket, Netflix, Tinder, and Buffer, have been affected by the service slowdown and, in some circumstances, service disruption. The incident started at 3 a.m. PT Sunday, or 6 a.m. within the location the place it had the best impression: Amazon’s most closely trafficked information middle advanced in Ashburn, Va., also called US-East-1.

Might 12, 2016 / Salesforce Outage: Can Prospects Belief the Cloud?, by Jessica Davis — The Salesforce service outage began on Tuesday with the corporate’s NA14 occasion, affecting prospects on the US west coast. And whereas service was restored on Wednesday after almost a full day of down time, the occasion has continued to expertise a degradation of service, in accordance with Salesforce’s on-line standing website.

March 7, 2017 / Is Amazon’s Progress Operating a Little Out of Management? by Charles Babcock — After a five-hour S3 outage in US East-1 Feb. 28, AWS operations explains that it was more durable to restart its S3 index system this time than the final time they tried to restart it.

Writes Babcock: “Given the truth that the outage began with an information entry error, a lot reporting on the incident has described the occasion as explainable as a human error. The human error concerned was so predictable and customary that that is an insufficient description of what is gone incorrect. It took solely a minor human error to set off AWS’ operational methods to begin working towards themselves. It is the runaway automated nature of the failure that is unsettling. Automated methods working in an inevitably self-defeating method is the mark of an immature structure.”

Quick Ahead to In the present day

As Sal Salamone detailed neatly this week, in his piece about classes discovered from current main outages: Cloudflare, Fastly, Akamai, Fb, AWS, Azure, Google, and IBM have all had calamities just like this in 2021-22. Human errors, software program bugs, energy surges, automated responses having surprising penalties, all inflicting havoc.

What is going to we be writing 15 years from now about cloud outages?

Possibly extra of the identical. However you won’t be capable of learn it if there’s lightning in Virginia.