[Webinar] Build Your GenAI Stack with Confluent and AWS | Register Now
If you are getting started with Kafka one thing you’ll need to do is pick a data format. The most important thing to do is be consistent across your usage. Any format, be it XML, JSON, or ASN.1, provided it is used consistently across the board, is better than a mishmash of ad hoc choices.
But if you are starting fresh with Kafka, you’ll have the format of your choice. So which is best? There are many criteria here: efficiency, ease of use, support in different programming languages, and so on. In our own use we have found Apache Avro to be one of the better choices for stream data.
Avro is an open source data serialization system that helps with data exchange between systems, programming languages, and processing frameworks. Avro helps define a binary format for your data, as well as map it to the programming language of your choice.
Confluent Platform works with any data format you prefer, but we added some special facilities for Avro because of its popularity. In the rest of this document I’ll go through some of the reasons why.
Avro has a JSON like data model, but can be represented as either JSON or in a compact binary form. It comes with a very sophisticated schema description language that describes data.
We think Avro is the best choice for a number of reasons:
Though it may seem like a minor thing handling this kind of metadata turns out to be one of the most critical and least appreciated aspects in keeping data high quality and easily useable at organizational scale.
One of the critical features of Avro is the ability to define a schema for your data. For example an event that represents the sale of a product might look like this:
{ "time": 1424849130111, "customer_id": 1234, "product_id": 5678, "quantity":3, "payment_type": "mastercard" }
It might have a schema like this that defines these five fields:
{ "type": "record", "doc":"This event records the sale of a product", "name": "ProductSaleEvent", "fields" : [ {"name":"time", "type":"long", "doc":"The time of the purchase"}, {"name":"customer_id", "type":"long", "doc":"The customer"}, {"name":"product_id", "type":"long", "doc":"The product"}, {"name":"quantity", "type":"int"}, {"name":"payment", "type":{"type":"enum", "name":"payment_types", "symbols":["cash","mastercard","visa"]}, "doc":"The method of payment"} ] }
A real event, of course, would probably have more fields and hopefully better doc strings, but this gives their flavor.
Here is how these schemas will be put to use. You will associate a schema like this with each Kafka topic. You can think of the schema much like the schema of a relational database table, giving the requirements for data that is produced into the topic as well as giving instructions on how to interpret data read from the topic.
The schemas end up serving a number of critical purposes:
The value of schemas is something that doesn’t become obvious when there is only one topic of data and a single application doing reading and writing. However when critical data streams are flowing through the system and dozens or hundreds of systems depend on this, simple tools for reasoning about data have enormous impact.
But first, you may be asking why we need schemas at all? Isn’t the modern world of big data all about unstructured data, dumped in whatever form is convenient, and parsed later when it is queried?
I will argue that schemas—when done right—can be a huge boon, keep your data clean, and make everyone more agile. Much of the reaction to schemas comes from two factors—historical limitations in relational databases that make schema changes difficult, and the immaturity of much of the modern distributed infrastructure which simply hasn’t had the time yet to get to the semantic layer of modeling done.
Here is the case for schemas, point-by-point.
One of the primary advantages of this type of architecture where data is modeled as streams is that applications are decoupled. Applications produce a stream of events capturing what occurred without knowledge of which things subscribe to these streams.
But in such a world, how can you reason about the correctness of the data? It isn’t feasible to test each application that produces a type of data against each thing that uses that data, many of these things may be off in Hadoop or in other teams with little communication. Testing all combinations is infeasible. In the absence of any real schema, new producers to a data stream will do their best to imitate existing data but jarring inconsistencies arise—certain magical string constants aren’t copied consistently, important fields are omitted, and so on.
Worse, the actual meaning of the data becomes obscure and often misunderstood by different applications because there is no real canonical documentation for the meaning of the fields. One person interprets a field one way and populates it accordingly and another interprets it differently.
Invariably you end up with a sort of informal plain english “schema” passed around between users of the data via wiki or over email which is then promptly lost or obsoleted by changes that don’t update this informal definition. We found this lack of documentation lead to people guessing as to the meaning of fields, which inevitably leads to bugs and incorrect data analysis when these guesses are wrong.
Keeping an up-to-date doc string for each field means there is always a canonical definition of what that value means.
Schemas also help solve one of the hardest problems in organization-wide data flow: modeling and handling change in data format. Schema definitions just capture a point in time, but your data needs to evolve with your business and with your code. There will always be new fields, changes in how data is represented, or new data streams. This is a problem that databases mostly ignore. A database table has a single schema for all it’s rows. But this kind of rigid definition won’t work if you are writing many applications that all change at different times and evolve the schema of shared data streams. If you have dozens of applications all using a central data stream they simply cannot all update at once.
And managing these changes gets more complicated as more people use the data and the number of different data streams grows. Surely adding a new field is a safe change, but is removing a field? What about renaming an existing field? What about changing a field from a string to a number?
These problems become particularly serious because of Hadoop or any other system that stores the events. Hadoop has the ability to load data “as is” either with Avro or in a columnar file format like Parquet or ORC. Thus the loading of data from data streams can be made quite automatic, but what happens when there is a format change? Do you need to re-process all your historical data to convert it to the new format? That can be quite a large effort when hundreds of TBs of data are involved. How do you know if a given change will require this? Do you guess and wait to see what will break when the change goes to production?
Schemas make it possible for systems with flexible data format like Hadoop or Cassandra to track upstream data changes and simply propagate these changes into their own storage without expensive reprocessing. Schemas give a mechanism for reasoning about which format changes will be compatible and (hence won’t require reprocessing) and which won’t.
We’ve found most people who have implemented a large scale streaming platform without schemas controlling the correctness of data have lead to serious instability at scale. These compatibility breakages are often particularly painful when used with a system like Kafka because producers of events may not even know of all the consumers, so manually testing compatibility can quickly become impossible. We’ve seen a number of companies have gone back and attempted to retrofit some kind of schema and compatibility checking on top of Kafka as the management of untyped data got unmanageable.
I actually buy many arguments for flexible types. Dynamically typed languages have an important role to play. And arguably databases, when used by a single application in a service-oriented fashion, don’t need to enforce a schema, since, after all, the service that owns the data is the real “schema” enforcer to the rest of the organization.
However data streams are different; they are a broadcast channel. Unlike an application’s database, the writer of the data is, almost by definition, not the reader. And worse, there are many readers, often in different parts of the organization. These two groups of people, the writers and the readers, need a concrete way to describe the data that will be exchanged between them and schemas provide exactly this.
It is almost a truism that data science, which I am using as a short-hand here for “putting data to effective use”, is 80% parsing, validation, and low-level data munging. Data scientists complain that their training spent too much time on statistics and algorithms and too little on regular expressions, XML parsing, and practical data munging skills. This is quite true in most organizations, but it is somewhat disappointing that there are people with PhDs in Physics spending their time trying to regular-expression date fields out of mis-formatted CSV data (that inevitably has commas inside the fields themselves).
This problem is particularly silly because the nonsense data isn’t forced upon us by some law of physics, this data doesn’t just arise out of nature. Whenever you have one team whose job is to parse out garbage data formats and try to munge together inconsistent inputs into something that can be analyzed, there is another corresponding team whose job is to generate that garbage data. And once a few people have built complex processes to parse the garbage, that garbage format will be enshrined forever and never changed. Had these two teams talked about what data was needed for analysis and what data was available for capture the entire problem could have been prevented.
The advantage isn’t limited to parsing. Much of what is done in this kind of data wrangling is munging disparate representations of data from various systems to look the same. It will turn out that similar business activities are captured in dramatically different ways in different parts of the same business. Building post hoc transformations can attempt to coerce these to look similar enough to perform analysis. However the same thing is possible at data capture time by just defining an enterprise-wide schema for common activities. If sales occur in 14 different business units it is worth figuring out if there is some commonality among these that can be enforced so that analysis can be done over all sales without post-processing. Schemas won’t automatically enforce this kind of thoughtful data modeling but they do give a tool by which you can enforce a standard like this.
We put this idea of schemafied event data into practice at large scale at LinkedIn. User activity events, metrics data, stream processing output, data computed in Hadoop, and database changes were all represented as streams of Avro events.
These events were automatically loaded into Hadoop. When a new Kafka topic was added that data would automatically flow into Hadoop and a corresponding Hive table would be created using the event schema. When the schema evolved that metadata was propagated into Hadoop. When someone wanted to create a new data stream, or evolve the schema for an existing one, the schema for that stream would undergo a quick review by a group of people who cared about data quality. This review would ensure this stream didn’t duplicate an existing event and that things like dates and field names followed the same conventions, and so on. Once the schema change was reviewed it would automatically flow throughout the system. This leads to a much more consistent, structured representation of data throughout the organization.
Other companies we have worked with have largely come to the same conclusion. Many started with loosely structured JSON data streams with no schemas or contracts as these were the easiest to implement. But over time almost all have realized that this loose definition simply doesn’t scale beyond a dozen people and that some kind of stronger metadata is needed to preserve data quality.
Okay that concludes the case for schemas. We chose Avro as a schema representation language after evaluating all the common options—JSON, XML, Thrift, protocol buffers, etc. We recommend it because it is the best thought-out of these for this purpose. It has a pure JSON representation for readability but also a binary representation for efficient storage. It has an exact compatibility model that enables the kind of compatibility checks described above. It’s data model maps well to Hadoop data formats and Hive as well as to other data systems. It also has bindings to all the common programming languages which makes it convenient to use programmatically.
Good overviews of Avro can be found here and here.
Here are some recommendations specific to Avro:
We have built tools for implementing Avro with Kafka or other systems as part of Confluent Platform. Most of our tools will work with any data format, but we do include a schema registry that specifically supports Avro. This is a great tool for getting started with Avro and Kafka. And for the fastest way to run Apache Kafka, you can check out Confluent Cloud and use the code CL60BLOG for an additional $60 of free usage.*
To start putting Avro into practice, check out the following tutorials:
We covered so much at Current 2024, from the 138 breakout sessions, lightning talks, and meetups on the expo floor to what happened on the main stage. If you heard any snippets or saw quotes from the Day 2 keynote, then you already know what I told the room: We are all data streaming engineers now.
We’re excited to announce Early Access for Confluent for VS Code. This Visual Studio integration streamlines workflows, accelerates development, and enhances real-time data processing, all in a unified environment. This post shows how to get started, and also lists opportunities to get involved.