Ask any data engineer about their biggest pain point, and the answer is almost universally the same: Upstream Schema Drift.
Your pipeline is running perfectly, your dbt tests are all green, and then, at 3 AM, your Production dashboard starts showing weird results. It turns out the microservice team responsible for the events stream decided to rename the column user_id to customer_uid and deprecated the old field — without telling you.
You were the last to know.
The Contract is a Promise
This failure is a organizational issue, not a technical one. The source application team made a breaking change without adhering to a Data Contract.
A Data Contract is a formal, machine-readable agreement between the data producer (the application team) and the data consumer (the data platform team). It defines four things:
- Schema: The exact field names, types, and nullability (e.g.,
user_idmust be a non-null UUID). - Quality: Semantic rules (e.g.,
revenuemust be greater than zero). - SLA: Latency and completeness promises.
- Ownership: Who is responsible for maintaining the contract.
Shifting Quality Left
The data platform team should not be responsible for cleaning up the garbage generated by upstream services. We need to push the quality check left, to the source.
We enforce our contracts using a Schema Registry (like Confluent or an open-source alternative) combined with our ingestion pipeline (Kafka).
Here is the operational workflow:
- Contract Definition: The application team defines their
eventscontract in a versioned YAML file (v1.0.0). - Validation Gate: Before the application's service can deploy, the contract must pass validation against the Schema Registry.
- Real-Time Enforcement: At runtime, the application attempts to publish data to a Kafka topic. The Kafka broker (via the Schema Registry) checks every incoming message against the current contract version.
- Fail Fast: If the service publishes a message that violates the contract (e.g., the
user_idfield is missing), the message is rejected at the broker level. The application throws an error immediately, forcing the service team to fix their code before the invalid data pollutes the data warehouse.
The Engineering Dividend
This system creates a clear boundary of ownership:
- Application Team: Responsible for ensuring the data they produce adheres to the contract.
- Data Team: Responsible for transforming and governing the data after it has been successfully admitted into the lake.
