Microsoft blames WGA meltdown on human error
- 30 August, 2007 08:12
Microsoft Corp. said late yesterday that last weekend's failure of the anti-piracy process it requires of Windows XP and Vista was due to "human error," shouldn't be called an "outage" since the servers didn't go offline and promised changes have been made to avoid a repeat.
In an earlier statement, the company had downplayed the scope of the problem, saying that fewer than 12,000 systems worldwide had been affected.
In a post to the Windows Genuine Advantage blog, program manager Alex Kochis, normally the public voice for the team, explained the malfunction of the company's validation servers in the greatest detail so far.
"Nothing more than human error started it all," said Kochis. "Pre-production code was sent to production servers. The production servers had not yet been upgraded with a recent change to enable stronger encryption/decryption of product keys during the activation and validation processes. The result of this is that the production servers declined activation and validation requests that should have passed."
Microsoft's anti-counterfeit measures come in two flavors: activation and validation. The former requires users to enter a valid 25-character product key to prove they've paid for a license; the latter is the term used for all subsequent proof-of-purchase demands, and engages, for instance, before users are allowed to download most software from the company's Web site.
The problem affected both the activation and validation servers, but while a quick roll-back -- within 30 minutes, according to Kochis -- solved the activation servers' problems, it failed to reset the validation servers. "We now realize that we didn't have the right monitoring in place to be sure the fixes had the intended effect," he said.
From the timeline he offered up in earlier postings, the failure started on Friday, Aug. 24 about 6:30 p.m. EDT. It's unclear how long Microsoft was unaware of the problem, although it was presumably measured in hours rather than minutes. "Through a combination of posts to our forum and customer support the issue was discovered by Friday evening." By Saturday, Aug. 25, at 2:15 p.m., the servers were again validating Windows correctly. The total time of the malfunction: 19 hours, 45 minutes.
Kochis also took pains to set the record straight about how the problem had been characterized. "It's important to clarify that this event was not an outage," he said. "This event was not the same as an outage because in this case the trusted source of validations itself responded incorrectly."
Contrary to what most users believed when they lit up Microsoft's support forums Friday night and Saturday, Kochis said PCs running XP or Vista automatically default to genuine -- the term Microsoft applies to machines running legitimate copies of their OSes -- if Microsoft's validation servers are offline. "In other words, we designed WGA to give the benefit of the doubt to our customers. If our servers are down, your system will pass validation every time."
He also expanded on previous statements about how Windows Vista users were affected when their copies were deemed counterfeit. Although several high-profile features were automatically disabled -- among them the Aero graphical user interface -- and the Windows Defender anti-spyware application was partially crippled, the three- or 30-day clock never started ticking that might have led to what Microsoft calls "reduced functionality mode" (RFM). When a Vista PC drops into RFM, it only runs Internet Explorer -- and then only for an hour at a time.
In fact, Kochis said, the disappearance of Vista features and reductions in functionality worked as planned. "Disabling the genuine-only features is meant to provide notice to the customer of the state of the system," he said. "When disabled, the features present their own error messages relating to the system not being genuine."
Kochis assured users that changes have been made, with more to follow. "We are improving our monitoring capabilities to alert us much sooner should anything like this happen again," he said. "We're also working through a list of additional changes such as increasing the speed of escalations and adding checkpoints before changes can be made to production servers."
On Monday, analysts took Microsoft to task for the WGA failure, and questioned the reliability of critical company processes. "Why don't they have a workable fail-over strategy for this service?" asked Michael Cherry, an analyst with Directions on Microsoft. "What does this say about the resiliency of Microsoft's services? After all, there will be failures."
While Microsoft hasn't issued a formal apology to users, in a round-about way, Kochis seemed to say that the company would try harder. "As an organization we've come a long way since this program began and it's difficult knowing that this event confused, inconvenienced, and upset our customers. It's unfortunate this happened to users with genuine systems."
Users who tried to validate their copies of XP or Vista during the breakdown who are still seeing the non-genuine warnings should head to this page on the Microsoft site, and click on the Validate Windows button.