Failure Modes and Edge Cases

Silent Semantic Corruption: The most dangerous failure is when schemas remain technically compatible but semantics change. A payment system adds refund support and starts sending negative amount values. The contract only validated type integer, not allowed ranges. Dashboards misinterpret refunds as revenue drops, triggering false alerts and wrong business decisions.
❗ Remember: Schema validation alone is insufficient. Add semantic checks (value ranges, enum constraints) and distribution level monitoring (percentile shifts, cardinality changes).
Versioning Complexity in Practice

A producer rolls out a new version adding optional fields and changing default values. Technically backward compatible, but a downstream job assuming non null fields now crashes. Worse, during migration the producer writes both old and new schema versions simultaneously for 2 weeks. Downstream systems receive mixed events in the same time window and must handle both formats.

Real example: An event stream added an optional session_id field. Old consumers ignored it. New consumers relied on it for deduplication. During the transition, 30 percent of events had nulls, causing new consumers to create duplicate records. The fix required backfill and dual read logic.
SLA Gaps: Regional and Disaster Recovery

A contract promises p95 freshness of 5 minutes for a user_actions topic, but only within a single region. Global aggregations relying on cross region replication see 30 minute p95 lag. Consumers in Europe expect 5 minute freshness based on the contract, but experience is 6x worse.
Regional SLA Reality
SAME REGION
5 min
vs
CROSS REGION
30 min

Disaster recovery scenarios worsen this. During regional failover, do SLAs relax? By how much? Most contracts fail to specify this. A system promising 99.9 percent availability might actually be 95 percent during failover, which happens twice per year.
Systematic Failures Hidden by Averages

A pipeline delivering daily data by 03:00 UTC at 99.9 percent success looks healthy in dashboards. But it fails systematically every month end when transaction volume spikes 3x. Financial close reports are late 12 times per year, yet the annual SLA shows green.

The fix requires percentile based metrics and time bucketed error budgets. Track not just "99.9 percent success over a quarter" but "99.9 percent success in each week." This surfaces recurring issues that quarterly averages hide.
Orphaned Datasets

A critical dataset loses its active owner during team reorganization. Contracts become stale. SLAs are not updated as usage patterns change. New consumers build critical dependencies on top. Six months later, an incident occurs and no one is accountable. The producing service might have been deprecated but data kept flowing out of momentum.
Handling Failure Modes: First, extend validation beyond schema: check value ranges, distributions, and cardinality. Second, specify SLAs with regional scoping and disaster recovery clauses. Third, use time bucketed error budgets to catch systematic issues. Fourth, implement ownership verification: quarterly audits ensuring each dataset has an active, accountable owner.

💡 Key Takeaways

✓Silent semantic corruption is most dangerous: schemas stay compatible but semantics change (negative amounts for refunds), corrupting dashboards without alerts

✓Versioning transitions create mixed event streams: 30% nulls in new optional fields broke deduplication logic for consumers expecting complete data

✓Regional SLA gaps: contract promises 5 minute p95 freshness in region, but cross region replication shows 30 minute lag (6x worse than expected)

✓Systematic failures hide in averages: 99.9% quarterly success masks month end failures when volume spikes 3x; use weekly error budgets instead

✓Orphaned datasets lose owners during reorganizations; contracts become stale, and critical incidents have no accountability until quarterly ownership audits

📌 Interview Tips

1Payment system added refund support with negative <code>amount</code> values; contract validated type but not range, causing dashboards to misinterpret refunds as revenue drops

2Event stream added optional <code>session_id</code>; during 2 week migration, 30% of events had nulls, causing new consumers to create duplicate records requiring backfill

3Pipeline with 99.9% annual SLA failed every month end at 3x volume spike, hitting financial close deadlines 12 times per year despite green dashboards

← Back to Data Contracts & SLAs Overview