Menu
Outage caused by single admin mortifies cloud provider Joyent

Outage caused by single admin mortifies cloud provider Joyent

Joyent is looking at how to improve software and operational procedures to prevent a reoccurrence

Cloud provider Joyent suffered an outage on Tuesday after an administrator was able to simultaneously reboot all virtual servers hosted in the company's US-East-1 data center.

"It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter," said Bryan Cantrill, CTO at Joyent, in a post on Hacker News.

The company first noticed something had gone wrong when it started seeing transient availability issues.

"Due to an operator error, all compute nodes in US-East-1 were simultaneously rebooted.  Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time," Joyent said in an initial update on the issue.

About an hour later after first reporting the problem, the company said that all compute nodes and virtual machines were back online.

Joyent didn't say how many customers or servers were affected by the reboot. However, an error of this magnitude shouldn't be allowed to happen, and highlights the importance of processes that balance the need for effective management and protecting users against these kinds of issues.

"As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are and will be making," Cantrill wrote.

The company is looking at how it can improve software and operational procedures to ensure that this doesn't happen in the future, and also how the recovery after a failure can be made smoother, according to Cantrill.

Just like any IT system, cloud-based services and servers can suffer from outages, but because the large number uses consequences are usually larger.

This week some Amazon Web Services users were hit by a power outage. Servers in one of the US-West-1 region's availability zones were affected, and it took almost three hours for Amazon to recover all instances. Amazon didn't elaborate on what caused the power failure.

Recently, Twitter also suffered an outage after a change to one of its core services went wrong, and HBO angered users of its Go service twice after it was overwhelmed by the number of people that wanted to watch the season premiere of "Game of Thrones" and the finale of "True Detective."

Send news tips and comments to mikael_ricknas@idg.com

Subscribe here for up-to-date channel news

Follow Us

Join the New Zealand Reseller News newsletter!

Error: Please check your email address.

Tags softwarecloud computinginternetJoyentsystem management

Featured

Slideshows

StorageCraft celebrates high achievers at its inaugural A/NZ Partner Awards

StorageCraft celebrates high achievers at its inaugural A/NZ Partner Awards

Revealed at a glitzy bash in Sydney at the Ivy Penthouse, the first StorageCraft Partner Awards locally saw the vendor honour its top-performing partners with ASI Solutions, SMBiT Pro, Webroot, ACA Pacific and Soft Solutions New Zealand taking home the top awards. Photos by Maria Stefina.

StorageCraft celebrates high achievers at its inaugural A/NZ Partner Awards
Kiwi resellers make a splash on Synnex and Lenovo RotoVegas road trip

Kiwi resellers make a splash on Synnex and Lenovo RotoVegas road trip

​Synnex and Lenovo hosted 18 resellers for an action-packed weekend adventure in RotoVegas, taking in white water rafting on the Kaituna River, as well as quad biking and dinner at Stratosfare​, overlooking Lake Rotorua at the top of Mount Ngongotaha​. Photos by Synnex.

Kiwi resellers make a splash on Synnex and Lenovo RotoVegas road trip
Show Comments