Microsoft offers credits for 'Leap Day' Azure outage

Microsoft offers credits for 'Leap Day' Azure outage

To make up for a string of outages that were caused by a software bug in its Azure cloud services Microsoft is granting affected customers a 33% credit for the time they were left stranded during the Feb. 29 failure.

Some Azure services were unaffected, and there is no credit being offered for them, the company says in its Windows Azure blog.

BACKGROUND: Microsoft's Azure cloud suffers serious outage

The problem stemmed from two overlapping circumstances: that Feb. 29 comes around only every four years and that when Azure initializes virtual machines for customer applications, a certificate is exchanged and given a valid-to date of one year. When certificates were issued starting 4 p.m. PST on Feb. 28, they were given a valid-to date of Feb. 29, 2013 which won't occur and was therefore interpreted as invalid.

This glitch set off a series of retries that also failed and led to the conclusion by the system that the hardware on which the virtual machines were running had failed. That led to attempts to migrate the failed virtual machines to other server hardware within the same Azure cluster, which consists of about 1,000 physical servers.

The migrated VMs also failed to initialize for the same reason and more and more hardware was judged failed until a threshold was reached that halted all attempts to reincarnate virtual machines anywhere in affected clusters. That allowed those clusters to stay in service at reduced levels, the blog says.

Azure also shut down the customer service management platform so customers couldn't add applications or expand capacity for running applications, both of which would have made the problem worse by calling for new virtual machines. "This is the first time we've ever taken this step," the blog says. Running applications were left intact.

It took 13 hours and 23 minutes to patch the bug in all but seven Azure clusters. Those seven were in the midst of a software upgrade, and so posed a separate problem. Should the host agents and guest agents that were exchanging the invalid certificates be upgraded to the newest patched versions or restored to the old versions but patched?

They decided on the latter, but that didn't work out because they didn't also revert to an earlier version of the network plug-in that configures a virtual machine's network. The new network plug-in was incompatible with the old host agents and guest agents. The result was that all virtual machines in those seven clusters were disconnected from the network.

The affected clusters included servers for Access Control Service (ACS) and Windows Azure Service Bus, both of which failed as a result. Cleaning up this problem entirely took until 2:15 a.m. March 1, the blog says.

Microsoft is taking three steps to prevent a similar problem. First, it will test for time incompatibilities in its software. It will also change its fault isolation so the system doesn't assume a hardware failure in this type of circumstance. And third, it will allow for a graceful degradation of customer management rather than turning the platform off altogether. This will allow blocking new virtual machines or expansion of old ones but continue to allow management of existing virtual machines.

The company is also upgrading its detection so issues are discovered and addressed more quickly. It is also upgrading the customer dashboard to remain more available in crises.

Because customer service lines were swamped, customers had to wait a long time for help, so the company is reevaluating staffing and considering better use of blogs, Twitter and Facebook to get the word out about problems.

To help during recovery from outages, the company is creating internal software tools, setting priorities to reestablish customer services more quickly and give customers better visibility into what progress is being made to restore services.

Read more about software in Network World's Software section.

Follow Us

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.



Meet the leading female front runners of the Kiwi channel

Meet the leading female front runners of the Kiwi channel

Reseller News honoured the leading female front runners of the New Zealand channel at the 2018 Women in ICT Awards (WIICTA) in Auckland. The awards honoured standout individuals across seven categories, spanning Entrepreneur; Innovation; Rising Star; Shining Star; Community; Technical and Achievement.

Meet the leading female front runners of the Kiwi channel
Meet the top performing customer-centric Microsoft channel partners

Meet the top performing customer-centric Microsoft channel partners

Microsoft honoured leading partners across the channel following a year of customer innovation and market growth in New Zealand. The 2018 Microsoft Partner Awards recognised excellence within the context of the end-user, spanning a host of emerging and established providers.

Meet the top performing customer-centric Microsoft channel partners
Reseller News launches new-look Awards at 2018 Judges’ Lunch

Reseller News launches new-look Awards at 2018 Judges’ Lunch

Introducing the Reseller News Innovation Awards, launched to the channel at the 2018 Judges’ Lunch in Auckland. With more than 70 judges now part of the voting panel, the new-look awards will reflect the changing dynamics of the channel, recognising excellence across customer value and innovation - spanning start-ups, partners, distributors and vendors.

Reseller News launches new-look Awards at 2018 Judges’ Lunch
Show Comments