MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ
Recently Google introduced a new type of Pub/Sub subscription called a “BigQuery subscription,” allowing to write directly from Cloud Pub/Sub to BigQuery. The company claims that this new extract, load, and transform (ELT) path will be able to simplify event-driven architectures.
BigQuery is a fully-managed, serverless data warehouse service in the Google Cloud intended for customers to manage and analyze large datasets (on the order of terabytes or petabytes). And Google Pub/Sub provides messaging between applications, which can be used for streaming analytics and data integration pipelines to ingest and distribute data. For data ingestion, customers had to write or run their own pipelines from Pub/Sub into BigQuery. They can do it directly with the new Pub/Sub subscription type.
Customers can create a new BigQuery subscription linked to a Pub/Sub topic. For this subscription, they must choose an existing BigQuery table. Furthermore, the table schema must adhere to certain compatibility requirements i.e. compatibility between the schema of the Pub/Sub topic and the BigQuery table. In a blog post, Qiqi Wu, a product manager at Google, explains the benefit of the schemas:
By taking advantage of Pub/Sub topic schemas, you have the option of writing Pub/Sub messages to BigQuery tables with compatible schemas. If the schema is not enabled for your topic, messages will be written to BigQuery as bytes or strings. After the creation of the BigQuery subscription, messages will now be directly ingested into BigQuery.
Richard Seroter, director of outbound product management at Google Cloud, wrote in a personal blog post on the BigQuery Subscription:
When I searched online, I saw various ways that people have stitched together their (cloud) messaging engines with their data warehouse. But from what I can tell, what we did here is the simplest, most-integrated way to pull that off.
However, Marcin Kutan, software engineer at Allegro Group, tweeted:
It could be a #pubsub feature of the year. But without topic schema evolution the adoption will be low. Now, I have to recreate the topic and subscription on every schema change.
Note that the company recommends using Dataflow for Pub/Sub messages where sophisticated preload transformations or data processing are required before letting data into BigQuery (such as masking PII).
Lastly, the ingestion from Pub/Sub’s BigQuery subscription into BigQuery costs $50/TiB based on read (subscribe throughput) from the subscription. More details on pricing are available on the Pub/Sub pricing page.