Menu
Outage caused by single admin mortifies cloud provider Joyent

Outage caused by single admin mortifies cloud provider Joyent

Joyent is looking at how to improve software and operational procedures to prevent a reoccurrence

Cloud provider Joyent suffered an outage on Tuesday after an administrator was able to simultaneously reboot all virtual servers hosted in the company's US-East-1 data center.

"It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter," said Bryan Cantrill, CTO at Joyent, in a post on Hacker News.

The company first noticed something had gone wrong when it started seeing transient availability issues.

"Due to an operator error, all compute nodes in US-East-1 were simultaneously rebooted.  Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time," Joyent said in an initial update on the issue.

About an hour later after first reporting the problem, the company said that all compute nodes and virtual machines were back online.

Joyent didn't say how many customers or servers were affected by the reboot. However, an error of this magnitude shouldn't be allowed to happen, and highlights the importance of processes that balance the need for effective management and protecting users against these kinds of issues.

"As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are and will be making," Cantrill wrote.

The company is looking at how it can improve software and operational procedures to ensure that this doesn't happen in the future, and also how the recovery after a failure can be made smoother, according to Cantrill.

Just like any IT system, cloud-based services and servers can suffer from outages, but because the large number uses consequences are usually larger.

This week some Amazon Web Services users were hit by a power outage. Servers in one of the US-West-1 region's availability zones were affected, and it took almost three hours for Amazon to recover all instances. Amazon didn't elaborate on what caused the power failure.

Recently, Twitter also suffered an outage after a change to one of its core services went wrong, and HBO angered users of its Go service twice after it was overwhelmed by the number of people that wanted to watch the season premiere of "Game of Thrones" and the finale of "True Detective."

Send news tips and comments to mikael_ricknas@idg.com


Follow Us

Join the newsletter!

Error: Please check your email address.

Tags softwarecloud computinginternetJoyentsystem management

Featured

Slideshows

Meet the top performing HP partners in NZ

Meet the top performing HP partners in NZ

HP honoured leading partners across the channel at the Partner Awards 2017 in New Zealand, recognising excellence across the entire print and personal systems portfolio.

Meet the top performing HP partners in NZ
Tech industry comes together as Lexel celebrates turning 30

Tech industry comes together as Lexel celebrates turning 30

Leading figures within the technology industry across New Zealand came together to celebrate 30 years of success for Lexel Systems, at a milestone birthday occasion at St Matthews in the City.​

Tech industry comes together as Lexel celebrates turning 30
HP re-imagines education through Auckland event launch

HP re-imagines education through Auckland event launch

HP New Zealand held an inaugural Evolve Education event at Aotea Centre in Auckland, welcoming over 70 principals, teachers and education experts to explore ways of shaping and enhancing learning using technology.

HP re-imagines education through Auckland event launch
Show Comments