[Webinar] How to Protect Sensitive Data with CSFLE | Register Today
Data contracts have been the focus of some hype in the data space for some time now. Nailing down what data contracts actually are in the real world hasn’t always been so clear. On the software engineering side of the fence there is a long and growing history of APIs as the contracts that enable different systems to communicate reliably. API specifications such as OpenAPI, data serialization formats such as Avro/Protobuf, schema registries for event streaming systems as well as practices for schema evolution are all employed. So the question is: Why do data teams need data contracts whereas APIs suffice for everyone else?
Chad Sanderson wrote an excellent article about the rise of data contracts and had some great insights that might help us answer this question. For my part, having worked in both software and data teams, but more in software than data, some of the things that Chad highlighted really struck me.
Chad points out that the intent of data contracts is more than just the provision of an API – it’s a contract between the software engineers who are the data producers and the data team who are the consumers. It’s an agreement between humans that aims to avoid upstream changes from breaking downstream dependencies.
But data teams are not the only ones who build against other people’s systems – this is the bread and butter of software engineering in the 21st century. The microservices and API revolutions have already been and gone, fundamentally changing the nature of software development. Software engineers don’t talk about the need for human to human contracts – instead they talk about APIs with strategies for permitting change without breaking dependent systems. Versioning, forwards and backwards schema compatibility, permissive consumers…the list goes on. We’re not without our disagreements of course; get any two engineers in the room and ask them to agree on how to correctly version REST APIs and you’ll see.
The point is that this is a solved problem so why are data engineers going on about the need for contracts that go beyond the API itself? The reality is that software teams also require more than just APIs too, it’s just that software teams are expected to collaborate with each other in order to deliver working software to the business. Software teams need each other in order to meet department goals and so they coordinate.
Not all software teams collaborate, sometimes one team depends on a third-party system and in those cases the API is an actual product of the third party. APIs that are literally commercial products come with all kinds of additional built-in assumptions beyond the schema of the API itself. Things like SLAs, good documentation, and general API design stability are some examples. An API that is constantly introducing breaking changes that require versioning makes for a poor quality product.
Historically, most software engineers haven’t been aware of the data team. The data team exists in a different plane of existence and software developers don’t need them to get their job done. Their services don’t depend on data team pipelines to function. Data teams are simply out of the loop and software teams have no reason to build APIs or event streams for them.
So what do data teams do in the age of Fivetran and CDC? They simply negotiate access to the database and suck data straight from there.
While connecting ELT/CDC tools to a production database seems like a great idea to easily and quickly load important data, this inevitably treats database schema as a non-consensual API. Engineers often never agreed (or even want) to provide this data to consumers. Without that agreement in place, the data becomes untrustworthy. Engineers want to be free to change their data to serve whatever operational use cases they need. No warning is given because the engineer doesn’t know that a warning should be given or why
This is what is driving the rise of data contracts. Breaking encapsulation was never the goal and definitely not desirable, it was simply a by-product of the need to develop pipelines fast and without needing to fight battles with software teams and their managers to get suitable APIs built for them. The trouble is, the house of cards only stands up for so long and before you know it your data team is spending most of its time fighting the fires of bad data and broken pipelines than actually delivering insights.
The concept of loose coupling through abstraction has been with us for decades and part of the fabric of software engineering culture. The API or an event stream with a negotiated schema is that abstraction. Today we have strategies for enforcing schemas as well as evolving schemas without breaking data producers or data consumers. Data contracts is about introducing this loose coupling through abstraction to data architectures. But more than just this, it is about getting software and data teams talking, collaborating, and achieving business goals together.
The trouble is, software teams don’t want to build APIs or event streams for the sole use of the data team—they already have a backlog as long as your arm. How do you compel a team to listen? Why would a software team put aside their own backlog of tasks to offer such an API?
The answer is not to treat the data team as an isolated group of specialists but instead one more team among many that needs to collaborate in order to achieve business goals. Data teams and software teams need to collaborate because it is valuable for the business for them to do so.
If software team A needs the data of the service of software team B, what team A doesn’t do is just read from their database directly. As discussed already, this is just not done. Instead people talk, things get prioritized and some kind of API or event stream is set up. The same must be true for data teams. There may be organizational headwinds caused by data and software teams coming under different leadership, but the effective organizations will be those that can make this collaboration happen.
So yes, APIs are needed but the most challenging and equally important puzzle piece is getting software and data teams aligned and working together – without that nothing will change.
The big data revolution of the early 2000s saw rapid growth in data creation, storage, and processing. A new set of architectures, tools, and technologies emerged to meet the demand. But what of big data today? You seldom hear of it anymore. Where has it gone?
Building data streaming applications, and growing them beyond a single team is challenging. Data silos develop easily and can be difficult to solve. The tools provided by Confluent’s Stream Governance platform can help break down those walls and make your data accessible to those who need it.