[Webinar] Bringing Flink to On-Prem and Private Clouds. Register Now

Streaming BigQuery Data Into Confluent in Real Time: A Continuous Query Approach

作成者 :

Confluent Cloud, a leading cloud-native platform for building data streaming applications, and BigQuery, Google's serverless data warehouse, are revolutionizing how businesses handle data. Together, they offer a powerful solution for real-time data ingestion, processing, and analysis—now enhanced by BigQuery’s new continuous query.

Many organizations grapple with challenges in moving data from their data warehouses to real-time processing platforms. Traditional methods often involve complex batch processes or change data capture mechanisms, leading to latency and operational overhead. Additionally, there's a growing need for real-time insights from data, which traditional data warehousing solutions struggle to provide.

By combining Confluent Cloud with BigQuery continuous queries, organizations can overcome these challenges. BigQuery continuous queries enable real-time data processing directly within the data warehouse, while Confluent Cloud efficiently captures and delivers these changes to downstream applications—even when data enrichment or transformation is needed in flight. This integration creates a seamless data pipeline for real-time analytics, fraud detection, and other time-sensitive use cases.

BigQuery continuous queries, an overview

BigQuery continuous queries operate continuously, processing SQL statements, allowing companies to analyze, transform, and replicate data in real time, as new events arrive in BigQuery. They use familiar SQL syntax to define the data analysis, and can handle large volumes of data efficiently. Companies can synchronize their data immediately as it reaches BigQuery, and then immediately push that data via Confluent Cloud to the other downstream applications and tools they need it in. This unlocks real-time use cases powered by data in BigQuery, such as immediate personalization, anomaly detection, real-time analytics and reverse ETL, etc. 

How BigQuery continuous queries work with Confluent Cloud 

Below is the architecture that illustrates how businesses can tap into the continuous flow of data through continuous queries from BigQuery using Confluent Cloud. This integration opens up a range of real-time use cases, ensuring that the latest information is always available for action. Continuous queries has three export options available today (Bigtable, BigQuery, or Pub/Sub), and for this integration we will be using Pub/Sub Export and Confluent’s fully managed Google Cloud Pub/Sub Source Connector to stream data in real time as it is added to BigQuery tables.

BigQuery continuous query results can now be streamed in real time into Confluent Cloud, further establishing it as the go-to platform for real-time data. Bringing data directly out of a continuously running query allows for faster insights and decision-making, while integrating with downstream applications for a more comprehensive view. 

Use cases unlocked by this joint solution

Confluent and BigQuery together offer a powerful combination of data streaming, processing, and analytics. By streaming BigQuery results and enriching them with Confluent Cloud in flight, Google Cloud and Confluent deliver enhanced value to customers. This versatility enables seamless integration with various downstream applications—from Salesforce and MongoDB, to Elastic.

The integration of Confluent and BigQuery continuous queries offers significant advantages for users:

  • Real-time data-driven decisions: Users can harness the power of Confluent's real-time data processing capabilities to derive immediate insights from BigQuery data, enabling faster decision-making and improved business agility.

  • Enhanced application development: Developers can build innovative applications that leverage both the batch processing power of BigQuery, and the real-time capabilities of Confluent, creating more engaging and responsive user experiences.

  • Streamlined data pipelines: By eliminating the need for complex batch processes, users can simplify their data pipelines, reduce operational overhead, and accelerate time-to-market for new applications.

  • Improved data governance: Confluent's data governance features, combined with BigQuery's data quality controls, help ensure data consistency and compliance.

  • Cost optimization: Efficient data movement and processing between BigQuery and Confluent can lead to reduced infrastructure costs and improved resource utilization.

  • Unlocking new business opportunities: Real-time insights and data-driven applications enabled by this integration can uncover new revenue streams and create competitive advantages.

The integration with BigQuery continuous queries positions Confluent as a more comprehensive platform for real-time data management, enhancing its value for customers and driving business value.

How to get started with continuous queries and Confluent Cloud

At the time this blog was written, the continuous queries feature is in preview and subjected to the "Pre-GA Offerings Terms.” To enroll in the continuous queries preview, fill out the request form.

Setting up and running BigQuery continuous queries

1) On Confluent Cloud Kafka Connect, you should use a service account to connect and consume data from Pub/Sub. You can configure a single service account for running continuous queries and consuming from Pub/Sub by assigning relevant permissions to the user. Configure the service account with permissions listed here. Make sure you create a JSON key, which you will need to configure the connection in Confluent Cloud in a later step.

2) To run continuous queries, BigQuery requires a slot reservation with a CONTINUOUS assignment type. Follow the steps here if you are not sure how to create a reservation.

3) Navigate to the Pub/Sub topic page and click on the Create Topic button at the top center of the page. Provide a name (e.g., “continuous_query_topic,” and also create a default subscription if needed).

4) Navigate to the BigQuery service page and design the query as an export to Pub/Sub:

EXPORT DATA
OPTIONS (
format = 'CLOUD_PUBSUB',
uri = 'https://pubsub.googleapis.com/projects/<your project_id>/topics/continuous_query_topic'
) AS
(
<Your Query>
);

In More Settings, as shown below, select the Query mode as “Continuous query” and in the Query settings, select the service account created above to run the query. You can also choose the timeout required, if any.

Before executing the query, make sure the below steps are done to ensure data continuously generated can be captured by the Confluent Cloud Pub/Sub connector.

5) Login to the Confluent Cloud console and set up the connectors, topics, etc., for capturing, processing and enriching the continuous stream of data from BigQuery. If you haven't already, create a Kafka cluster from the console.

6) Then create a topic in the cluster you created above, using a name, e.g., “bq_continuous_queries_test”. 

7) Navigate to Cluster → Connectors and click the Add a connector button at top right. Select “Google Cloud Pub/Sub” Source connector. On the next page, select the bq_continuous_queries_test topic and click Continue.

8) On the page after that, select an API key (if you haven’t created one already, follow steps here), and use it. Click Continue.

9) On the next page, provide your Google Project ID and Pub/Sub topic name,  i.e., the default Pub/Sub subscription you created above. Also upload the service account key JSON key you created in a previous step. Then click Continue.

10) On the page after that, leave the formats as default, and ensure the data format is JSON. Click Continue. Assign the connector sizing as needed and review and Launch. In a few seconds, the connector will be in running state.

11) As soon as the connector is started, head back to the Google Cloud BigQuery Console and run the continuous query you drafted above.

12) When the query is fired up, the Pub/Sub topic will receive the initial records in the table, as well as any subsequent rows added to the table. Confluent’s managed Pub/Sub connector, which you configured above, will start receiving messages, and they will be available in the Kafka topic you created:

13) Once the topic starts receiving the data, utilize ksqlDB/Flink/Kafka Streams to process and enrich the data, so you can complete your use case by streaming the events to any supported destinations, including applications and services.

Conclusion

Now you can leverage BigQuery continuous queries to stream real-time data directly into Confluent Cloud, enabling you to create solutions that were not nearly as straightforward to make in the past. This opens up a lot of new opportunities for customers of Confluent and Google: By harnessing the power of the combined solution, organizations can unlock the full potential of their data, and gain a competitive edge with real-time, data-driven decision-making, streamlined data pipelines, and improved application performance.

To learn more about BigQuery continuous queries click here. Begin experimenting with Confluent Cloud on the Google Cloud Marketplace today!

  • Dustin Shammo is a Senior Solutions Engineer at Confluent where he focuses on the Google Cloud Partnership. Prior to joining Confluent, Dustin spent time at Google and IBM helping customers solve complex challenges.

  • Jobin George is a Staff Technical Architect at Google, transforming the way Google's key customers and partners work with data. His expertise in large-scale Data & Analytics solutions fuels his thought leadership and innovative technical guidance. He's known for his strategic & collaborative approach, working closely with Google's key customers and partners to understand their unique challenges and architect solutions that drive success.

このブログ記事は気に入りましたか?今すぐ共有