Amazon Outage: Multiple Zones A Smart Strategy - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Cloud // Infrastructure as a Service
News
10/23/2012
12:17 PM
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Amazon Outage: Multiple Zones A Smart Strategy

Amazon Web Services loses one availability zone Monday at its most heavily used site.

Traffic in Amazon Web Services' most heavily used data center complex, U.S. East-1 in Northern Virginia, was tied up by an outage in one of its availability zones Monday morning. Damage control got underway immediately but the effects of the outage were felt throughout the day.

Customers were affected shortly after noon Eastern Time, when they were unable to access Amazon's Elastic Beanstalk scaling and Elastic Block Store service, which holds frequently accessed data used by hosted applications such as Salesforce.com's Heroku cloud platform, Pinterest, and news aggregator Reddit. Netflix, Github, Minecraft, Airbnb, FastCompany, and FourSquare also reported that they had been affected.

"We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance," Amazon's Service Health Dashboard reported at 11:26 a.m. Monday.

Other services, such as Amazon's Relational Database Service, depend heavily on EBS.

Teacher forum and education site Edmodo.com noted that its servers were unavailable in a Twitter posting at 2:20 p.m: "Update: The site is still down. This is a server issue related to Amazon and we will update as soon as we have more info."

[ Want to learn more about how Amazon's availability zones work? See Inside One Amazon Customer's Zone Defense. ]

Sites that operate on a strict budget often take advantage of the minimal infrastructure costs associated with Amazon cloud services and operate in only one availability zone. But an outage in one zone can sometimes affect the availability of some services in others, as seen in the Easter weekend outage in April 2011.

Savvy customers, such as Netflix, who've made a major investment in use of Amazon's EC2, can sometimes avoid service interruptions by using multiple zones. But as reported by NBC News, some Netflix regional services were affected by Monday's outage.

The outage started as a slowdown in response times and an increase in error message rates in the Elastic Block Store service in one availability zone. The site hosts five different zones, or virtual data centers, each with an independent source of telecommunications power and backup power. Some customers keep recovery copies of their systems in a second zone to provide a failover mechanism if one availability zone goes down.

Okta, an Amazon EC2-based identity management service, uses all five zones to hedge against outages. "If there's a sixth zone tomorrow, you can bet we'll be in it within a few days. We make use of every possible zone. We need to be up at all times," said Adam D'Amico, Okta's director of technical operations. Netflix service architect Adrian Cockcroft and others have advocated in public forums that customers use more than one zone for their own protection.

The trouble for Amazon persisted through the day. At 9:30 p.m. Eastern, its Health dashboard reported, "We are seeing elevated errors rates on APIs related to describing and associating EIP addresses. We are working to resolve these errors. In addition, ELB is experiencing elevated latencies recovering affected load balancers and making changes to existing load balancers. These delays… will improve when that issue is resolved."

At 10:36 p.m. Eastern, it added, "…we expect ELB to recover more quickly now." Most problems were cleared up by 1:30 a.m. Tuesday.

Most IT teams monitor website performance. It's time to extend that vigilance to all critical applications. Also in the new, all-digital Application Early Warning System issue of InformationWeek: While Oracle and SAP wage a war of words, they're ignoring the wishes of customers like Procter & Gamble. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
cmclennan452
50%
50%
cmclennan452,
User Rank: Apprentice
10/30/2012 | 9:37:55 PM
re: Amazon Outage: Multiple Zones A Smart Strategy
Hi Charles, great insight. At Ilesfay (cloud based replication startup) weG«÷ve never gone down even though weG«÷ve been using AWS (all regions) since 2009. FYI: Here are some of our key principles for building resilient cloud applications: http://www.ilesfay.com/cms/def...
Slideshows
IT Careers: Top 10 US Cities for Tech Jobs
Cynthia Harvey, Freelance Journalist, InformationWeek,  1/14/2020
Commentary
Predictions for Cloud Computing in 2020
James Kobielus, Research Director, Futurum,  1/9/2020
News
What's Next: AI and Data Trends for 2020 and Beyond
Jessica Davis, Senior Editor, Enterprise Apps,  12/30/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Slideshows
Flash Poll