[Demo-Webinar] Bereit, ZooKeeper Lebewohl zu sagen? Lerne KRaft kennen! | Jetzt registrieren
The divide between operational and analytical systems has long resulted in data inconsistencies, unreliability, and redundancies.
Without a single, unified source of truth, teams interpret information in their own ways—often after the fact. This can lead to downstream data discrepancies, issues, and distrust.
Meanwhile, changes to upstream data structures create ripple effects, breaking downstream systems and requiring manual intervention to fix issues. This unreliability prevents the use of an organization’s data for critical processes and applications, reducing its value.
Furthermore, when data is needed in multiple systems, both the problems and the work to resolve them are duplicated, further compounding inefficiencies.
If you’re going to meet the data needs of your organization, and fuel the latest advancements in ML and AI to drive revenue and operational efficiency, you’ll need to breach this divide between operational and analytical systems.
You do this by shifting some of these problems left, where they can be more easily and more cheaply resolved. And, once resolved, the results are available to everyone downstream.
You use data contracts to enable this shift left.
Shift left is a concept borrowed from software development, where tasks traditionally performed later in the development lifecycle are moved earlier (or "leftward"), to improve efficiency and reduce risk. For example, testing is now done much earlier in the software development lifecycle by software engineers through unit testing, integration testing, and continuous integration, rather than handed off to a separate team.
In the context of data, shifting left means addressing data quality, structure, and governance at the point of creation, rather than reacting to issues after they propagate downstream.
By shifting left, you ensure that data is well structured, governed, and validated before it enters analytical and operational systems. This proactive approach minimizes downstream errors, reduces maintenance costs, and improves trust in your data.
To achieve this, you need to address people, processes, and technology, in order to change how data is treated across your organization.
Let’s look at each of these in turn.
When considering how you can enable a shift to the left, you need to look at the people and teams involved in the production of your data, and the changes those teams will need to make.
In many organizations, much of the most important data is produced by applications, owned by application engineering teams. Other application engineering teams then consume that data to build operational services. Data engineering teams also use it—typically combined with other datasets—to produce data for analytical services.
If the data from an application engineering team is of poor quality, each of its consumers needs to take actions to resolve those problems. This adds costs due to duplication and inconsistencies, since the logic used is slightly different in each of the downstream actions.
Furthermore, any unmanaged changes in the data from the application—for example, changes to the schema—impact each of the downstream consumers. Again, this leads to high costs, as each downstream team has to implement a fix in its applications.
If these issues are happening regularly, consumers will lose trust in the data, and avoid using it in critical processes and applications, diminishing its value to the organization.
This is why you need to shift the responsibility left, to avoid unnecessary downstream costs, and to unlock opportunities earlier in the data’s lifecycle. By doing so, you ensure that data can be used more widely throughout your organization, jump-starting valuable use cases—including those that utilize ML and AI.
If application engineers are going to take on this responsibility, they need to understand why it’s important to the organization. You can highlight its importance by calculating the ongoing costs. One example of this would be to record the number of incidents caused by poor quality data, and add up the costs each downstream team had to spend in time and effort to remediate each incident. You can also look at the business opportunities that depend on this data, and calculate the amount of revenue they are expected to drive for the organization.
Once you have these costs determined, you can use the metrics as justification for the necessary organizational/team changes to shift data quality left— addressing issues at the source to reduce costs and improve outcomes downstream (to the right).
Beyond the application engineering and data engineering teams, there’s another team that is key to enabling a shift-left approach: the data platform team. The primary role of the data platform team is to provide the technology that enables this shift, as we’ll discuss later in the “Technology” section of this blog. But they also play a crucial secondary role: facilitating discussions between application engineering and data engineering.
Data platform teams are generally made up of people who have a good understanding of both software engineering and data engineering. They leverage this expertise to build tools that enable both groups to move data around reliably, and at scale. But that expertise can also be used to help bridge the communication gap between application engineering and data engineering, bringing these teams closer together, despite their typically operating in different parts of the business.
Once you establish an agreement that shifting left is the best approach for your organization, you can start to think about the processes you need to enable that shift—data products and data contracts.
Let’s discuss these next.
Now that you understand the importance of shifting left and the role that different teams play in achieving this goal, you need to establish processes that make this approach the standard for sharing data.
These processes revolve around data products and data contracts.
A data product is a trustworthy, purpose-built dataset designed for sharing and reuse across teams. Data products are immediately useful to their intended consumers, and are built to follow a standard that ensures consistency. While valuable on their own, they can also be easily combined and enriched to create additional data products for multiple use cases.
High-quality data products encourage consistency, by providing a standardized dataset that all teams can work from, rather than pulling and processing data in silos. This reduces discrepancies, enhances trust in the data, and eliminates redundant efforts in cleaning and transformation.
To shift left, not only should the data engineers apply this product mindset to their data, but the application engineers should also apply the same mindset to the data they create. Application engineers too need to understand why their data is important, who will consume it, and how it meets those consumers’ needs. The data they create should be immediately useful, with its standardization ensuring that data engineers can seamlessly integrate upstream data products into those designed for analytical use.
Each data product is associated with a set of metadata that describes it and simplifies its use. At a minimum, this includes the owner and the schema, but it can also include additional details that ease management and consumption.
This metadata needs to be managed effectively and made available to other systems. That’s what the data contract does.
A data contract is a human- and machine-readable document that captures this metadata. It acts as a record of agreements between data producers and consumers, and includes any additional metadata supporting the creation, management, and use of the data, including:
Data owner
Schema definitions
Documentation
Data categorization
Service-level agreements (SLAs)
Versioning and change management policies
Since the data contract defines the data product, it must be owned by the data product owner—whether in application engineering or data engineering. Only they have the full context needed to populate it, and the ability to meet the agreements it defines.
You can think of the data contract as “wrapping” the data product, capturing the metadata that describes it, and allowing that metadata to flow with the data product throughout its lifecycle, ensuring it is applied at each stage.
Being human readable, the data contract is the perfect place to capture agreements between data producers and consumers. For example, it can codify the SLAs for the data product, with any changes to those SLAs being tracked over time.
At the same time, its machine-readable format allows the data platform team to use the data contract to create tooling and configure services, including the implementation of interfaces, as we’ll see in the next section.
With people aligned and processes defined, we can now think about the technology needed to enable this shift to the left.
Imagine a system where well-structured data products exist in one place, readily accessible across various platforms. This data should be available when needed, and in the format required—be it as a stream or in a structured table. Different stakeholders, including application engineers, data engineers, and analytics and operations teams, should be able to access and use the data seamlessly.
This is the system a data platform team can build with the Confluent data streaming platform and data contracts.
You do this by first defining the human- and machine-readable data contract, which, as mentioned earlier, must be defined by the owners of the data product, whether that’s application engineering or data engineering.
This data contract is then published to the Confluent Schema Registry, which is the interface from which operational services can consume the data stream. Some tooling may be needed here, such as a continuous deployment (CD) task that automatically publishes updates to the data contract in the Schema Registry. This tooling is provided by the data platform team, who are evolving from focusing solely on maintaining infrastructure like Apache Kafka® and data lakehouses, to enabling data producers, and influencing their behaviors through effective tooling.
Once the data contract is in the Schema Registry, it can be applied to a Kafka topic. This Kafka topic is an interface through which data can be consumed. Because the interface is driven by the data contract, it matches the agreements made between data producers and consumers, and continues to reflect those agreements as they evolve.
Having data flow through Kafka is ideal for operational use cases that can be designed to work with event streams, but is less suited for batch-based, analytical services. For these use cases, you can make the same data available though a tabular interface, maintaining the schema and metadata captured in the data contract, by using Tableflow.
With data flowing through Kafka and readily available in tabular interfaces, teams can also shift some processing left to transform, enrich, or combine data products, by using stream processing tools like Confluent Cloud for Apache Flink®. This approach enables the creation of contextual, readily usable data products wherever needed.
This gives you a complete data streaming platform serving both the operational and analytical use cases, all powered by the same high-quality data products from the application engineers on the left, through interfaces managed by the data contract.
This is just scratching the surface of the data-contract capabilities offered by the Confluent Schema Registry. It can do much more, including implementing data-quality checks within a stream (and by extension, on a data lakehouse), and defining migration rules. Read Using Data Contracts with Confluent Schema Registry to learn more.
By shifting left with data contracts, you fundamentally change how data is managed—improving quality, reliability, and usability across your application engineering, data engineering, and data platform teams. Instead of reacting to data issues downstream, you prevent them at the source—where data is created—by ensuring clear ownership, standardized datasets, and effective change management.
With data contracts, application engineers can take responsibility for the quality of the data they produce, ensuring it is structured and governed before it enters operational and analytical systems. Data engineers, along with other application engineers, benefit from standardized, high-quality data products that reduce duplication and inconsistencies.
Meanwhile, data platform teams play a crucial role in enabling this shift by providing the infrastructure, tooling, and governance needed to support seamless data sharing and enforce contracts.
The Confluent data streaming platform and Schema Registry enable this approach, ensuring that data is not only reliable, but also easily discoverable and reusable across the organization.
By aligning the application engineering, data engineering, and data platform teams, organizations can eliminate inefficiencies, build trust in their data, and unlock new opportunities for AI, machine learning, and data-driven decision-making.
Apache®, Apache Kafka®, Kafka®, Apache Flink®, and Flink® are registered trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by the Apache Software Foundation is implied by using these marks. All other trademarks are the property of their respective owners.
Most AI projects fail not because of bad models, but because of bad data. Siloed, stale, and locked in batch pipelines, enterprise data isn’t AI-ready. This post breaks down the data liberation problem and how streaming solves it—freeing real-time data so AI can actually deliver value.
Before deploying agentic AI, enterprises should be prepared to address several issues that could impact the trustworthiness and security of the system.