Turn Failure Detection into a Team Sport - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
IT Leadership // Security & Risk Strategy
Commentary
2/21/2020
07:00 AM
Prasad Ramakrishnan, CIO, Freshworks
Prasad Ramakrishnan, CIO, Freshworks
Commentary
50%
50%

Turn Failure Detection into a Team Sport

Here's how Chaos GameDays and its spinoffs can enable enterprises to fortify their infrastructure resilience and detect failures before they occur.

Image: Olivier LeMoal - stockadobe.com
Image: Olivier LeMoal - stockadobe.com

Preventing IT infrastructure failure is serious business. So is Chaos GameDays, the somewhat whimsical name given to the series of “chaos engineering” exercises designed to detect failures before they occur.

Count me as one of Chaos GameDays’ many proponents. From an operational and business perspective, proactive failure detection is far more sensible than reactive failure response.

Played periodically under defined rules, Chaos GameDays is designed to simulate a wide range of scenarios, including attempts to hack into and break systems components. This is done not just to predict system failure but also to build greater system resilience to prevent failure from ever occurring.

Think of it like a flu vaccine

As noted by the Gremlin Community, a good analogy for Chaos GameDays is that it is akin to a flu vaccine: injecting “a potentially harmful foreign body in order to prevent illness.”

Chaos GameDays is the gamification subset of Chaos Engineering, pioneered by Netflix circa 2010 just as the video-streaming company was transitioning to a distributed, cloud-based architecture. To protect these revolutionary yet extremely complex systems, Netflix -- soon joined by the world’s largest tech enterprises -- realized they needed new ways to predict failures in order to prevent them.

“If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most -- in the event of an unexpected outage,” Netflix wrote in its company blog soon after implementing the innovative approach. “The best way to avoid failure is to fail constantly.” And with so many more streaming services available today than a few years ago, Netflix certainly doesn’t want its existing customers to consider other options and stream elsewhere.

From there, the idea of Chaos GameDays was born, conceived by Orion Labs founder Jesse Robbins. His lightbulb moment occurred when he realized the best way to fix major failures was to create them -- and that gamifying the process would be a fun, team-oriented approach to develop crisis-preparedness frameworks that can maintain, protect and enhance an enterprise’s infrastructure.

GameDays or not, best practices remain the same

Time for a disclaimer: My company doesn’t engage in typical GameDays practices, but we do assemble DevOps teams that run similar types of infrastructure stress tests approximately every 15 weeks. These test runs are designed to mimic possible -- and sometimes even impossible -- hypothetical situations in order to determine how effective our teams’ proposed solutions mitigate risk and prevent incidents, and how quickly our teams can respond when failure occurs.

Whether you follow the Chaos GameDays route or implement other team-oriented failure-detection exercises, following a few basic best practices will go a long way toward keeping your operations running optimally when it matters most. They include using AI-based data analysis to help identify whether certain combinations of incidents or recurring patterns of issues in each exercise point to specific disasters-in-waiting.

It’s also important to search for and identify points of failure to include personnel availability and readiness, define keywords to describe each problem and how serious it is, and refine your communication templates to ensure you aren’t wasting time composing one-off messages in an emergency.

Then, make sure every team member responds to questions like these to ensure that everybody has the same focus and objectives:

  • How would you respond to each incident?
  • What are the predicted times to resolution?
  • Do you understand our existing disaster-response policies?
  • Do we have communication messaging templates ready so that we aren’t wasting time in an emergency?
  • What should we include in our playbook for those responding to incidents?

All enterprises -- particularly those whose survival and success depend on delivering exceptional customer experiences -- require hyper-resilient infrastructures and the appropriate IT service management (ITSM) tools that can sift through, tag and route issues. The most successful businesses, though, know that diving into the chaos of incident-prediction and incident-prevention is critical to staying ahead of the game.

 

Prasad Ramakrishnan is CIO of Freshworks, a customer engagement software company. With over 25 years of experience in the IT sector, Ramakrishnan manages the business systems, business intelligence and global IT infrastructure of Freshworks. Over the last decade he championed the transition to a cloud and SaaS-based infrastructure at companies like Veeva Systems, HotChalk, Bodhtree, Infoblox and FormFactor.

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Commentary
Future IT Teams Will Include More Non-Traditional Members
Lisa Morgan, Freelance Writer,  4/1/2020
News
COVID-19: Using Data to Map Infections, Hospital Beds, and More
Jessica Davis, Senior Editor, Enterprise Apps,  3/25/2020
Commentary
Enterprise Guide to Robotic Process Automation
Cathleen Gagne, Managing Editor, InformationWeek,  3/23/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
IT Careers: Tech Drives Constant Change
Advances in information technology and management concepts mean that IT professionals must update their skill sets, even their career goals on an almost yearly basis. In this IT Trend Report, experts share advice on how IT pros can keep up with this every-changing job market. Read it today!
Slideshows
Flash Poll