Failure Modes and Production Edge Cases

The Worst Case: Inaccurate Metadata:
Stale or wrong metadata is worse than no metadata at all. If your catalog claims a table is certified and fresh, but it's actually deprecated and hasn't updated in 6 months, analysts will lose trust. They'll revert to asking colleagues or manual verification, defeating the catalog's purpose.

Common causes include failed connectors that stop emitting events, schema change events that get dropped in message queues, and lineage extraction bugs that misparse queries. At scale with dozens of integrations, you must assume some fraction of events will be delayed or lost.

❗ Remember: Design reconciliation from day one. Periodic full rescans of sources catch missed events and correct drift between catalog state and reality.

Production systems run reconciliation jobs that rescan warehouse schemas every 6 to 24 hours, comparing catalog state with source truth. When discrepancies are found (table exists in warehouse but not catalog, or schema differs), automated alerts fire and self healing logic attempts to correct the catalog. This two layer approach (real time events plus periodic reconciliation) keeps accuracy above 99 percent even when individual events fail.

The Hidden Consumer Problem:
Lineage systems capture dependencies through orchestrated jobs and query logs. But what about ad hoc queries run directly on the warehouse outside managed pipelines? A data scientist might run a one off analysis that reads from customer_profiles, builds a machine learning model, then deploys that model to production. The catalog never sees this dependency.

Months later, an engineer drops a column from customer_profiles. The catalog shows no downstream consumers, so it looks safe. The deployed model breaks, causing production incidents.

Incomplete Lineage Timeline
WEEK 1
Ad hoc query
→
WEEK 8
Schema change
→
RESULT
Model breaks

Companies mitigate this by parsing query logs to discover hidden dependencies, encouraging all production workloads to run through managed orchestration where lineage is captured, and using query fingerprinting to identify tables that are frequently accessed even if not through tracked jobs. Some add manual declaration: before deploying a model, engineers must register its input dependencies in the catalog.

Even with these measures, you'll miss some dependencies. The best defense is defensive schema changes: deprecate columns with long warning periods, use schema evolution techniques like adding nullable columns instead of changing types, and monitor error rates after changes to catch unexpected breakage.

Permission Sync Failures:
The catalog might think a user has access to a dataset when the warehouse denies them, or vice versa. This creates confusing UX: search shows a table, but clicking through to query it returns permission denied.

Root causes include eventual consistency in permission propagation, the catalog and warehouse using different identity providers, and manual permission changes in the warehouse that bypass catalog APIs. These mismatches can also create security risks if the catalog incorrectly grants visibility to sensitive data.

Production solutions make the catalog a read only mirror of warehouse permissions, with periodic reconciliation every 5 to 15 minutes to sync state. Some companies go further and make the catalog the source of truth, with warehouse permissions driven by catalog policy, but this requires tighter integration.

Graph Query Explosions:
Even with limits, some graph topologies cause performance problems. A highly connected hub node (like a central user dimension table) might connect to thousands of downstream tables. Traversing from that hub can still timeout.

Production systems detect hub nodes during indexing and apply special handling: precompute immediate neighbors, cache results aggressively, and show a warning in the UI that this node has over 500 connections with a subset displayed. Users can click to load more, but the default view stays fast.

Another edge case is cycles in the lineage graph. If table A feeds job B which writes to table C which somehow feeds back to table A, naive traversal can loop forever. Graph queries must track visited nodes and detect cycles, showing them explicitly in the UI as "circular dependency detected."

💡 Key Takeaways

✓Inaccurate metadata is worse than no metadata; periodic reconciliation jobs that rescan sources every 6 to 24 hours catch missed events and maintain over 99 percent accuracy

✓Ad hoc queries outside managed orchestration create hidden dependencies; companies parse query logs and use fingerprinting but still miss some, requiring defensive schema change practices

✓Permission mismatches between catalog and warehouse cause confusing UX and security risks; solutions mirror permissions with 5 to 15 minute reconciliation or make catalog the source of truth

✓Hub nodes with thousands of connections can cause graph query timeouts; precompute neighbors, cache aggressively, and show subset warnings in UI

✓Cycles in lineage graphs can cause infinite loops; graph traversal must track visited nodes and explicitly detect and display circular dependencies

📌 Interview Tips

1A connector to Redshift fails silently for 8 hours. The catalog shows tables as up to date when they're actually stale. The nightly reconciliation job at 2 AM detects 500 tables with schema drift, auto corrects most, and pages on call for 20 that require manual review.

2A data scientist runs a query on <code>user_events</code> to build a churn model, deploys it to production. Two months later an engineer renames <code>user_events</code> to <code>events_v2</code>. The catalog shows zero dependencies. The model breaks, causing 4 hours of downtime before the connection is discovered.

3The <code>dim_users</code> table connects to 2,000 downstream tables. When someone views its lineage, the UI loads the first 100 in 180 ms, shows a badge that says "2,000 total downstream dependencies", and offers async loading for the rest to prevent timeout.

← Back to Data Catalog Systems Overview