[Workshop] Stream-Verarbeitung leicht gemacht mit Flink | Jetzt registrieren

Introducing the New Fully Managed BigQuery Sink V2 Connector for Confluent Cloud: Streamlined Data Ingestion and Cost-Efficiency

Verfasst von

We’re excited to announce the launch of the new fully managed BigQuery Sink V2 Connector, a significant upgrade to our existing BigQuery Sink (Legacy) Connector. The new connector leverages the Google-recommended Storage Write API, adds OAuth 2.0 support, streams records in parallel through an internal thread pool, and supports multiple data formats. If you’re an existing customer, and would like to take advantage of the benefits of the new connector, we recommend taking a look at our migration recommendation.

Confluent and Google Cloud have a strategic partnership that combines the leading streaming service based on Apache Kafka® with best-in-class Google Cloud solutions for big data and data in motion computing. With the enhancements in the BigQuery Sink V2 connector, we are aligning our offerings with Google's best practices and recommendations, ensuring our customers benefit from the latest and most performant technologies.

What is the BigQuery Sink V2 connector and the Storage Write API?

BigQuery is Google Cloud's fully managed, petabyte-scale analytics data warehouse that lets you run analytics in near-real time. With the new BigQuery Sink V2 connector, streaming data from Kafka topics to BigQuery becomes more streamlined and optimized for performance.

The Storage Write API is designed to provide a more powerful and cost-effective method for writing data into BigQuery. It uses gRPC, which offers faster data transfer and lower latency than REST over HTTP, thus offering a 50% reduction in streaming data ingestion. This makes it an ideal solution for use cases that involve large-scale data ingestion, such as real-time analytics, data archiving, and machine learning. By combining streaming ingestion and batch loading into a single high-performance API, this API not only simplifies the data ingestion process but also reduces the total cost of ownership (TCO) for customers. It's a testament to Google Cloud's commitment to making data management simplified, economical, and scalable.

Key features and enhancements

Confluent’s BigQuery Sink V2 connector comes with a host of enhancements designed to optimize your data ingestion process:

  • Improved efficiency: By utilizing Google’s Storage Write API instead of the legacy streaming API, the updated connector combines streaming ingestion and batch loading into a single high-performance API, simplifying the data ingestion process for users.

  • Enhanced security: OAuth 2.0 Authorization Code grant flow is supported when connecting to BigQuery, available when the connector is created via the Cloud Console.

  • Greater scalability: This connector maintains scalability even though it defaults to streaming mode (as opposed to batch mode) because of its internal thread pool that allows it to stream records in parallel. The internal thread pool defaults to 10 threads.

  • Ability to stream from topics: The connector can stream data from a list of topics into corresponding tables in BigQuery.

  • Added time-based table partitioning: The connector supports several time-based table partitioning strategies.

  • Invalid record routing to DLQ: Route invalid records to the DLQ including any records that have the gRPC status code INVALID_ARGUMENT from the BigQuery Storage Write API (KIP-610).

  • Additional support for multiple data formats: The connector supports Avro, JSON Schema, Protobuf, or JSON (schemaless) input data formats. Note that Schema Registry must be enabled to use a Schema Registry-based format.

  • Expanded availability: While the BigQuery Sink V2 connector is currently available only on Google Cloud by default, the new connector can also be made available for Confluent Cloud clusters on AWS and Azure.

Steps to migrate from legacy to the V2 connector

Customers using the legacy BigQuery Sink connector are advised not to move their existing pipelines directly to the new connector. Instead, create a test environment and run the new connector against a test pipeline. Once you confirm that there are no backward compatibility issues, then move your existing pipelines to use the new connector.

Migrating from the legacy BigQuery Sink connector to the new V2 connector involves creating a new topic and running the new connector against a test pipeline. Here's a step-by-step guide on how to do this:

  1. Create a new topic: In your Kafka cluster, create a new topic that will be used for the BigQuery Sink V2 connector. This ensures that there are no topic-related data types and schema issues.

  2. Configure the V2 connector: In the Confluent Cloud Console, configure the new BigQuery Sink V2 connector. Make sure to select the new topic you created in the previous step.

  3. Test the pipeline: Run the new connector against a test pipeline. Monitor the data ingestion into BigQuery to ensure that there are no backward compatibility issues.

  4. Migrate the pipeline: Once you've confirmed that the new connector works as expected, you can start migrating your existing pipelines to use the new connector.

This will ensure that there are no topic-related data types and schema issues. This migration strategy creates minimal data duplication in the destination table because it only picks the offsets written for the new topic.

In the future, Confluent will provide more details on our migration of hundreds of tasks across production and non-production environments from the Legacy BigQuery Sink Connector to V2.

Demo: Setting up the BigQuery Sink V2 connector for Confluent Cloud

To help you understand how to set up the new V2 connector, we've put together a demo that walks you through the steps. 

​​Step 1: Launch your Confluent Cloud cluster.

See the Quick Start for Confluent Cloud for installation instructions.

Step 2: Add a connector.

In the left navigation menu, click Connectors. If you already have connectors in your cluster, click + Add connector.

Step 3: Select your connector.

Click the Google BigQuery Sink V2 connector card.

Step 4: Enter the connector details.

  • Be sure you have all the prerequisites completed.

  • An asterisk ( * ) in the Cloud Console designates a required entry.

At the Add Google BigQuery Sink V2 screen, complete the following:

If you’ve already added Kafka topics, select the topic(s) you want to connect from the Topics list.

To create a new topic, click + Add new topic.

Select the way you want to provide Kafka Cluster credentials. Click Continue.

Select the Authentication method.

  • Google Cloud service account (default): Upload your GCP credentials file. This is a Google Cloud service account JSON file with write permissions for BigQuery. For additional details, see Create a Service Account.

  • OAuth 2.0 (Bring your own app): Select this option to use Google OAuth 2.0.

  • For OAuth 2.0 configuration and other details, see Set up Google OAuth.

    • Client ID: Enter the OAuth client ID returned when setting up OAuth 2.0.

    • Client secret: Enter the client secret for the client ID.

In the Project ID field, enter the ID for the Google Cloud project where BigQuery is located.

Enter the name for the BigQuery dataset the connector writes to in the Dataset field.

If using a Service Account, click Continue. If using OAuth 2.0 (Bring your own app), click Authenticate, select your Google account, and choose whether to use your account information for the connector to authenticate with BigQuery.

Allow the OAuth authentication.

Select the Ingestion Mode: STREAMING or BATCH LOADING. Defaults to STREAMING. Review the following table for details about the differences between using BATCH LOADING versus STREAMING mode with the BigQuery API. For more information, see Introduction to loading data.

Note that BigQuery API quotas apply.

Select the Input Kafka record value format (data coming from the Kafka topic): AVRO, JSON_SR (JSON Schema), or PROTOBUF. A valid schema must be available in Schema Registry to use a schema-based message format (for example, Avro, JSON_SR (JSON Schema), or Protobuf). See Schema Registry Enabled Environments for additional information.

Show advanced configurations.

Click Continue.

Step 5: Check the results in BigQuery.

  1. From the Google Cloud Console, go to your BigQuery project.

  2. Query your datasets and verify that new records are being added.

Learn more and try out the new BigQuery Sink V2 connector

In conclusion, the launch of the new BigQuery Sink V2 connector is a major milestone in our partnership with Google Cloud. This upgraded connector offers improved efficiency, cost-effectiveness, and flexibility over its predecessor. Whether you're a current user of the legacy BigQuerySink connector or new to Confluent, we believe the BigQuery Sink V2 connector will provide immense value to your data ingestion process. 

We encourage you to try out the new connector and experience it for yourself. You can learn more on how to get started in the documentation and release notes. We are excited about the possibilities this new connector brings and look forward to seeing how our customers leverage it to streamline their data ingestion processes.

Want to see the V2 BigQuery sink connector in action? Join our upcoming demo webinar to learn how you can build a real-time customer 360 solution with Confluent and Google BigQuery.

  • Dustin Shammo is a Senior Solutions Engineer at Confluent where he focuses on the Google Cloud Partnership. Prior to joining Confluent, Dustin spent time at Google and IBM helping customers solve complex challenges.

  • Jai Vignesh began his career as a developer before transitioning to a product manager (PM). As a PM, he has led the roadmap for multiple products, delighting customers and driving product success. Currently, Jai is part of the Confluent Connect PM team. In this role, he focuses on ensuring seamless connectivity between Confluent and various databases, data warehouses, and SaaS applications.

Ist dieser Blog-Beitrag interessant? Jetzt teilen