Microsoft Azure Outage Blamed On Bad Code - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Cloud // Cloud Storage
News
12/22/2014
09:27 AM
Connect Directly
Twitter
RSS
E-Mail
100%
0%

Microsoft Azure Outage Blamed On Bad Code

Microsoft's analysis of Nov. 18 Azure outage indicates engineers' decision to widely deploy misconfigured code triggered major cloud outage.

6 IT Career Resolutions
6 IT Career Resolutions
(Click image for larger view and slideshow.)

Microsoft has corrected a software bug in its Azure Cloud Storage that, when deployed widely on Nov. 18, triggered a massive outage. In some cases the error triggered infinite loops and tied up storage servers, dragging down much of the Azure cloud into a drastic slowdown.

As a result, connection rates to Azure on Nov. 18 dropped from 97% to 7%-8% after 7 p.m. Eastern in Northern Virginia. The Azure data center in Dallas suffered a complete outage for a short while. Data centers in Europe didn't recover until deep into the following day.

Microsoft had tested the storage update code before deployment, but contrary to its own best practices, Azure administrators rolled it out to Azure storage services as a whole instead of "flighting" -- limiting the roll-out to small sections at a time.

"The standard flighting deployment policy of incrementally deploying changes across small slices was not followed," wrote Microsoft's Jason Zander, corporate VP for the Azure team, in a blog Dec. 17.

[Want to learn more about the cloud outage's impact? See Microsoft Azure Storage Service Outage: Postmortem.]

The key problem, however, was a configuration issue in Azure Table storage front ends. "The configuration switch was incorrectly enabled for Azure Blob storage front-ends," wrote Zander. Table storage front-ends record the sequence of the different data types going into a Blob (a service for storing large amounts of unstructured data) and can be used to guide the data's retrieval. The error in the configuration switch appears to have caused an infinite loop.

(Source: Nemo/Pixabay)
(Source: Nemo/Pixabay)

The original change was meant to improve Azure Storage performance. In test after test, including pre-production staging, it did so and proved reliable, wrote Zander. That may have lead to overconfidence and haste in attempting to deploy the update and realize the performance gains.

Whatever the cause, Azure administrators have implemented automated practices that won't allow a human decision to overrule its "flighting" best practice -- using separate and limited implementations for putting new code into production.

In perhaps the clearest outcome of the incident, Zander wrote: "Microsoft Azure had clear operating guidelines, but there was a gap in the deployment tooling that relied on human decisions ... With the tooling updates, the policy is now enforced by the deployment platform itself."

Zander acknowledged that cloud operations must become more reliable and said Microsoft will continue to work on that goal. "We sincerely apologize and recognize the significant impact this service interruption may have had on your applications and services," he wrote.

Network Computing's new Must Reads is a compendium of our best recent coverage of storage. In this issue, you'll learn why storage arrays are shrinking for the better, discover the ways in which the storage industry is evolving towards 3D flash, find out how to choose a vendor wisely for cloud-based disaster recovery, and more. Get the Must Reads: Storage issue from Network Computing today.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Keith Craig
50%
50%
Keith Craig,
User Rank: Apprentice
12/23/2014 | 11:32:49 AM
Cloud demand
The compounding expansion of the cloud into enterprise and the consequent rush to meet this demand by cloud-hosts is going to create similar snafus.

And while MicroSoft is "big enough" to endure such a critical faux pas, its premature Azure code release will sully its cloud "expertise" among enterprise CIOs and individual developers that might be considering where to find a host - but not for long.

This burgeoning demand will continue to drive and improve the bottom line for most cloud-host providers, no matter their track record in maintaining what is arguably the most important criteria of all cloud customers - steady and consistent uptime.

Only repeated outages - from Azure or any other cloud host - will serve to curb the market's appetite for that cloud-host. The current buyers' market for cloud-hosting services hasn't paused for a shake-out, yet. But if we've learned anything here at Linode, it is that customers value uptime more than anything else a cloud-host can offer.
David F. Carr
50%
50%
David F. Carr,
User Rank: Author
12/22/2014 | 10:52:12 AM
What, someone didn't want to miss happy hour?
What I'd really like to know is, what was the hurry? If staged releases were the norm, what prompted someone to skip that step and roll new code live on a global basis? I envision someone wanting to get out the door early at the end of the day, whether for happy hour or their kid's soccer practice.

Implementing automated controls is a good idea, but I suspect people will still find a way to subvert them from time to time.
Commentary
Get Your Enterprise Ready for 5G
Mary E. Shacklett, Mary E. Shacklett,  1/14/2020
Commentary
Modern App Dev: An Enterprise Guide
Cathleen Gagne, Managing Editor, InformationWeek,  1/5/2020
Slideshows
9 Ways to Improve IT and Operational Efficiencies in 2020
Cynthia Harvey, Freelance Journalist, InformationWeek,  1/2/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Slideshows
Flash Poll