Google Cloud has unveiled a new BigQuery service designed to remove one of data science’s primary pain points: having to move and unify data across environments in order to query it.
Named BigQuery Omni, the first phase will see private alpha Google Cloud customers able to blend AWS data into the BigQuery data warehouse to run SQL queries, build dashboards, or push through APIs, without having to physically move any data, with similar capabilities for Microsoft Azure “coming soon.”
“Multicloud creates a problem – data becomes siloed and running analytics on that data needs data movement. To solve that problem BigQuery Omni lets customers analyse data no matter where that is: Google Cloud, AWS as a private alpha, and very soon on Microsoft Azure,” Debanjan Saha, GM of data analytics at Google said during a press conference last week.
Data movement is often cited as one of the primary pain points for data scientists and analysts, and it often comes with significant compute costs, which require justification with the finance team.
Here, Saha promises a service which gives users “a consistent data experience using the same SQL and user interface they use in BigQuery for queries, dashboards and to run analytics for consistency and familiarity.”
How BigQuery Omni works
By decoupling storage and compute, BigQuery Omni claims to be able to provide “stateless resilient compute that executes standard SQL queries,” Saha writes. “While competitors will require you to move or copy your data from one public cloud to another, where you might incur egress costs, this is not the case with BigQuery Omni,” he adds.
The service is underpinned by Google Cloud’s Anthos platform, which provides a single, consistent way of managing Kubernetes workloads across on-prem and public cloud environments.
This containerized architecture allows the data to stay in its AWS S3 bucket, where it is queried using Google Cloud’s Dremel engine, running natively on an Anthos cluster in the same region where the data resides. The results are then passed back to BigQuery, or your data storage of choice, where it is combined with any other relevant data, with no associated data movement costs.
Saha gives the example of a retailer wanting to seamlessly query both their Google Analytics 360 Ads data, which is stored in Google Cloud, and log data from an e-commerce platform, which is stored in AWS S3, to get a fuller picture of customer buying habits.
This structure also allows Google Cloud to position BigQuery Omni as serverless, allowing users to query data without having to manage the underlying infrastructure.
“It will be serverless on AWS and on Azure when it is available,” Saha explained to the press last week. “The idea is to spin up compute as a shared resource pool and as we have multiple customers running queries we can share and scale up those resources. Run the query on AWS and we will transfer the results to Google and join it with results there.”
Getting started with BigQuery Omni
As Saha outlines in his blog post, once signed up to the private alpha, customers can get started direct within the BigQuery user experience on the Google Cloud console.
You just select the region where data is located and run the query, with no requirement to format or transform the data, regardless of if it is Avro, CSV, JSON, ORC, or Parquet.
Results will appear in BigQuery or can be exported back to the data storage of your choice, with no need to manually move it across clouds. You will have to enable BigQuery to access this data via the other public clouds’ IAM roles, however.
After launch, the cost of Omni will be in line with BigQuery pricing, so based on usage or as a flat rate. There are no additional storage costs outside of what you already pay to AWS for S3 storage, or similarly for Azure in future.