Menu
How Netflix survived the Amazon EC2 reboot

How Netflix survived the Amazon EC2 reboot

The video streaming service was able to stay online even as its cloud hosting provider rebooted its servers

Sometimes the best path to success is to learn how to avoid failure.

Netflix was able to keep serving its customers while its cloud hosting provider, Amazon Web Services (AWS), rebooted servers, because it had prepared for that happening.

"When we got the news about the emergency EC2 [Elastic Cloud Compute] reboots, our jaws dropped. When we got the list of how many Cassandra nodes would be affected, I felt ill," said Christos Kalantzis, Netflix engineering manager of cloud database engineering, in a Netflix blog post discussing the outage.

Amazon announced to EC2 customers on Sept. 25 that it would be updating its servers and that a small percentage would require a reboot, which could potentially disrupt customer services. AWS did not specify which of their virtual hosts would be rebooted or when. It was revealed later that AWS was fixing a vulnerability in the Xen hypervisor, which underpins EC2.

Netflix is one of Amazon's largest customers. And its 50 million customers expect to be able to stream TV shows, movies and other content at any time. If Netflix wasn't prepared to mitigate potential outages, the company -- and not Amazon -- would have a lot of angry customers.

But Netflix had architected its service to be resilient, so that if one Amazon data center went down, operations could be switched over to another with barely a noticeable bump to customers. It also looked for ways to minimize downtime that occurred when its services did need to be rebooted.

The company even went the extra mile and aggressively looked for ways to try to disrupt its own services through a set of tools called the Simian Army that are designed to periodically and randomly kill Netflix services. The thinking goes that any Netflix service should be resilient enough to keep running through an attack from one such tool. If it isn't, then the Netflix engineers redesign the service to make it more reliable.

Even with its systems hardened by abuse from Chaos Monkey and other Simian Army tools, the engineers were still worried about the AWS reboot.

In particular, concern centered around the 2,700 Cassandra databases that the company runs on AWS.

Databases, as the blog post pointed out, are "the pampered and spoiled princes of the application world." They are run on the best hardware, get lots of attention from database engineers and still can be fussy creatures.

Netflix deliberately chose to use the Cassandra database over more traditional choices such as Oracle's databases because, as a NoSQL database, Cassandra could be spread across multiple servers in such a way that if one of the nodes failed, the database could keep running. Over the past year, the company had been subjecting Cassandra to Chaos Monkey testing, with promising results.

The AWS reboot would be the first true test of Cassandra's reliability, however. The entire cloud database engineering team was on alert.

In the end, and thanks to Chaos Monkey testing, most all of the Cassandra nodes remained online. Of the 218 Cassandra nodes that were rebooted, only 22 did not return to a full operational state, and those were successfully restarted with minimal human intervention.

"Repeatedly and regularly exercising failure, even in the persistence layer, should be part of every company's resilience planning," the blog concluded. "If it wasn't for Cassandra's participation in Chaos Monkey, this story would have ended much differently."

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Follow Us

Join the New Zealand Reseller News newsletter!

Error: Please check your email address.

Tags amazoncloud computingnetflixinternetInfrastructure services

Featured

Slideshows

Reseller News launches inaugural Hall of Fame lunch

Reseller News launches inaugural Hall of Fame lunch

Reseller News welcomed 2015 and 2016 inductees - Darryl Swann, Dave Rosenberg, Gary Bigwood, Keith Watson, Mike Hill and Scott Green - to the inaugural Reseller News Hall of Fame lunch, held at the French Cafe in Auckland. The inductees discussed how the channel can collectively work together to benefit New Zealand, the Kiwi skills shortage and the future of the industry. Photos by Maria Stefina.

Reseller News launches inaugural Hall of Fame lunch
Educating from the epicentre - Why distributors are the pulse checkers of the channel

Educating from the epicentre - Why distributors are the pulse checkers of the channel

​As the channel changes and industry voices deepen, the need for clarity and insight heightens. Market misconceptions talk of an “under pressure” distribution space, with competitors in that fateful “race for relevance” across New Zealand. Amidst the cliched assumptions however, distribution is once again showing its strength, as a force to be listened to, rather than questioned. Traditionally, the role was born out of a need for vendors and resellers to find one another, acting as a bridge between the testing lab and the marketplace. Yet despite new technologies and business approaches shaking the channel to its very core, distributors remain tied to the epicentre - providing the voice of reason amidst a seismic industry shift. In looking across both sides of the vendor and partner fences, the middle concept of the three-tier chain remains centrally placed to understand the metrics of two differing worlds, as the continual pulse checkers of the local channel. This exclusive Reseller News Roundtable, in association with Dicker Data and rhipe, examined the pivotal role of distribution in understanding the health of the channel, educating from the epicentre as the market transforms at a rapid rate.

Educating from the epicentre - Why distributors are the pulse checkers of the channel
Kiwi channel reunites as After Hours kicks off 2017

Kiwi channel reunites as After Hours kicks off 2017

After Hours made a welcome return to the channel social calendar last night, with a bumper crowd of distributors, vendors and resellers descending on The Jefferson in Auckland to kickstart 2017. Photos by Maria Stefina.

Kiwi channel reunites as After Hours kicks off 2017
Show Comments