Menu
The bulk update behind Google Cloud’s global service disruptions

The bulk update behind Google Cloud’s global service disruptions

The trigger of the incident was a bulk update of group memberships that expanded to an unexpectedly high number of modified permissions

Credit: Google Cloud

In the late afternoon on 26 March (US Pacific time), Google warned users that several of its cloud services were experiencing disruptions due to issues with Google Cloud infrastructure components, while its Cloud Composer environment creations had been failing globally.

At its peak, Google Cloud said the issues had impacted no fewer than 20 of its cloud services in multiple regions, including Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud Console and more.

It wasn’t until the following day that Google Cloud said it had resolved the issue with “Google Cloud infrastructure components” underpinning the disruptions, which would have impacted users across the globe for several hours. 

Now, after conducting an internal investigation and taking steps to improve the resiliency of our service, the cloud vendor has revealed the core issue that caused the global disruptions.

“Many cloud services depend on a distributed Access Control List (ACL) in Identity and Access Management (IAM) for validating permissions, activating new APIs [application programming interfaces], or creating new cloud resources,” Google Cloud said in a post dated 1 April. “These permissions are stored in a distributed database and are heavily cached. 

“Two processes keep the database up-to-date; one real-time, and one batch. However, if the real-time pipeline falls too far behind, stale data is served which may cause impact operations in downstream services.

“The trigger of the incident was a bulk update of group memberships that expanded to an unexpectedly high number of modified permissions, which generated a large backlog of queued mutations to be applied in real-time. 

“The processing of the backlog was degraded by a latent issue with the cache servers, which led to them running out of memory; this in turn resulted in requests to IAM timing out. The problem was temporarily exacerbated in various regions by emergency rollouts performed to mitigate the high memory usage,” it said. 

According to Google Cloud, on Thursday 26 March at 4:14PM (US Pacific time), cloud IAM experienced elevated error rates which caused stale data and disruption across many services for a duration of three-and-a-half hours, resulting in continued disruption in administrative operations for a subset of services for 14 hours. 

Additionally, multiple services experienced bursts of cloud IAM errors, Google Cloud said, adding that the spikes were largely clustered around a handful of specific times between 4:35PM and 7:40PM on 26 March. 

“Error rates reached up to 100 per cent in the later two periods as mitigations propagated globally,” Google Cloud said. “As a result, many cloud services experienced concurrent outages in multiple regions, and most regions experienced some impact.”


Follow Us

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags Google Cloud

Events

Why experience is the new battleground for partners

Join us for an exclusive webinar, in association with Hewlett Packard Enterprise and Technology Services Industry Association (TSIA) and learn about the latest industry insights and how technology services continue to evolve to deliver differentiated value, and how partners can be successful in 2021 and beyond.

Featured

Slideshows

The Kiwi channel gathers for the 2020 Reseller News Women in ICT Awards

The Kiwi channel gathers for the 2020 Reseller News Women in ICT Awards

Hundreds of leaders from the New Zealand IT industry gathered at the Hilton in Auckland on 17 November to celebrate the finest female talent in the Kiwi channel and recognise the winners of the Reseller News Women in ICT Awards (WIICTA) 2020.

The Kiwi channel gathers for the 2020 Reseller News Women in ICT Awards
Leading female front runners honoured at the 2020 Reseller News Women in ICT Awards

Leading female front runners honoured at the 2020 Reseller News Women in ICT Awards

The leading female front runners of the New Zealand ICT industry joined together for the annual Reseller News Women in ICT Awards event at the Hilton in Auckland, during which hundreds of guests celebrated 13 outstanding individuals who won awards, chosen from more than 50 finalists representing over 30 organisations.

Leading female front runners honoured at the 2020 Reseller News Women in ICT Awards
Channel gathers to celebrate the Reseller News Innovation Awards 2020 winners

Channel gathers to celebrate the Reseller News Innovation Awards 2020 winners

More than 500 channel leaders gathered in Auckland on 21 October at the ​Reseller News Innovation Awards ​2020 to celebrate the achievements of the New Zealand technology industry's top partners, start-ups, vendors, distributors and individuals.

Channel gathers to celebrate the Reseller News Innovation Awards 2020 winners
Show Comments