Menu
Microsoft offers credits for 'Leap Day' Azure outage

Microsoft offers credits for 'Leap Day' Azure outage

To make up for a string of outages that were caused by a software bug in its Azure cloud services Microsoft is granting affected customers a 33% credit for the time they were left stranded during the Feb. 29 failure.

Some Azure services were unaffected, and there is no credit being offered for them, the company says in its Windows Azure blog.

BACKGROUND: Microsoft's Azure cloud suffers serious outage

The problem stemmed from two overlapping circumstances: that Feb. 29 comes around only every four years and that when Azure initializes virtual machines for customer applications, a certificate is exchanged and given a valid-to date of one year. When certificates were issued starting 4 p.m. PST on Feb. 28, they were given a valid-to date of Feb. 29, 2013 which won't occur and was therefore interpreted as invalid.

This glitch set off a series of retries that also failed and led to the conclusion by the system that the hardware on which the virtual machines were running had failed. That led to attempts to migrate the failed virtual machines to other server hardware within the same Azure cluster, which consists of about 1,000 physical servers.

The migrated VMs also failed to initialize for the same reason and more and more hardware was judged failed until a threshold was reached that halted all attempts to reincarnate virtual machines anywhere in affected clusters. That allowed those clusters to stay in service at reduced levels, the blog says.

Azure also shut down the customer service management platform so customers couldn't add applications or expand capacity for running applications, both of which would have made the problem worse by calling for new virtual machines. "This is the first time we've ever taken this step," the blog says. Running applications were left intact.

It took 13 hours and 23 minutes to patch the bug in all but seven Azure clusters. Those seven were in the midst of a software upgrade, and so posed a separate problem. Should the host agents and guest agents that were exchanging the invalid certificates be upgraded to the newest patched versions or restored to the old versions but patched?

They decided on the latter, but that didn't work out because they didn't also revert to an earlier version of the network plug-in that configures a virtual machine's network. The new network plug-in was incompatible with the old host agents and guest agents. The result was that all virtual machines in those seven clusters were disconnected from the network.

The affected clusters included servers for Access Control Service (ACS) and Windows Azure Service Bus, both of which failed as a result. Cleaning up this problem entirely took until 2:15 a.m. March 1, the blog says.

Microsoft is taking three steps to prevent a similar problem. First, it will test for time incompatibilities in its software. It will also change its fault isolation so the system doesn't assume a hardware failure in this type of circumstance. And third, it will allow for a graceful degradation of customer management rather than turning the platform off altogether. This will allow blocking new virtual machines or expansion of old ones but continue to allow management of existing virtual machines.

The company is also upgrading its detection so issues are discovered and addressed more quickly. It is also upgrading the customer dashboard to remain more available in crises.

Because customer service lines were swamped, customers had to wait a long time for help, so the company is reevaluating staffing and considering better use of blogs, Twitter and Facebook to get the word out about problems.

To help during recovery from outages, the company is creating internal software tools, setting priorities to reestablish customer services more quickly and give customers better visibility into what progress is being made to restore services.

Read more about software in Network World's Software section.


Follow Us

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Events

Featured

Slideshows

Meet the Reseller News 30 Under 30 Tech Awards 2020 winners

Meet the Reseller News 30 Under 30 Tech Awards 2020 winners

This year’s Reseller News 30 Under 30 Tech Awards were held as an integral part of the first entirely virtual Emerging Leaders​ forum, an annual event dedicated to identifying, educating and showcasing the New Zealand technology market’s rising stars. The 30 Under 30 Tech Awards 2020 recognised the outstanding achievements and business excellence of 30 talented individuals​, across both young leaders and those just starting out. In this slideshow, Reseller News honours this year's winners and captures their thoughts about how their ideas of leadership have changed over time.​

Meet the Reseller News 30 Under 30 Tech Awards 2020 winners
Reseller News Exchange Auckland: Beyond the myths — how partners can master cloud security

Reseller News Exchange Auckland: Beyond the myths — how partners can master cloud security

This exclusive Reseller News Exchange event in Auckland explored the challenges facing the partner community on the cloud security frontier, as well as market trends, customer priorities and how the channel can capitalise on the opportunities available. In association with Arrow, Bitdefender, Exclusive Networks, Fortinet and Palo Alto Networks. Photos by Gino Demeer.

Reseller News Exchange Auckland: Beyond the myths — how partners can master cloud security
Reseller News welcomes industry figures at 2020 Hall of Fame lunch

Reseller News welcomes industry figures at 2020 Hall of Fame lunch

Reseller News welcomed 2019 inductees - Leanne Buer, Ross Jenkins and Terry Dunn - to the fourth running of the Reseller News Hall of Fame lunch, held at the French Cafe in Auckland. The inductees discussed the changing face of the IT channel ecosystem in New Zealand and what it means to be a Reseller News Hall of Fame inductee. Photos by Gino Demeer.

Reseller News welcomes industry figures at 2020 Hall of Fame lunch
Show Comments