Lineage Trade Offs: When to Go Deep

The Granularity Decision:

Choosing lineage granularity is not about picking the best option. It's about matching capability to actual need while controlling cost and complexity.

Table Level
Cheap, fast, good for impact analysis. Covers 80 to 90% of use cases.
vs
Column Level
Precise for governance, but 100x more links and parsing overhead.

Table level lineage works when you need coarse dependency tracking. If you're doing pre deployment impact analysis (will changing this table break downstream jobs?), table level suffices. Your team can review affected jobs manually. The metadata volume is manageable: 10,000 tables might have 50,000 dependency links.

Column level lineage becomes necessary when you must answer precise questions. Compliance asks: does user.ssn ever flow to the marketing database? Table level lineage says the users table feeds marketing.campaigns, but not whether that specific column does. Column level lineage gives a definitive answer.

The cost is real. Those same 10,000 tables with average 50 columns each create 5 million+ potential column links. Query parsing or dataflow instrumentation adds compute overhead. Warehouses like BigQuery cap column links at 1,500 per job specifically to avoid metadata explosions.

Automation vs Manual Curation:

Automatic lineage extraction from query logs covers 80 to 90% of flows with zero developer effort. Cloud warehouses do this natively. Spark and Flink have integrations via OpenLineage or proprietary APIs. High adoption, low friction.

The downside: automatic lineage misses context. It knows table_a feeds table_b, but not that table_b is the golden source for revenue reporting. It doesn't capture manual data imports, spreadsheet uploads, or business semantics.

Manual annotations via data catalogs fix this. Teams tag critical datasets, document business owners, and link to runbooks. This increases accuracy and usefulness for non technical stakeholders. The trade off is ongoing process. Someone must maintain annotations as pipelines evolve.

In practice, start with automatic lineage to get broad coverage fast. Layer manual curation on the 20% of datasets that matter most for compliance, revenue, or critical operations.

Vendor Integration vs Open Standards:

Vendor solutions like Google Dataplex or AWS Glue give deep integration and polished UIs. BigQuery lineage appears automatically. The visualization is production ready. No infrastructure to manage.

The fragmentation risk is real. If your stack spans BigQuery, Snowflake, Databricks, and Kafka, vendor lineage gives you three separate graphs. Cross system flows have gaps. During an incident, you're mentally stitching together multiple tools.

OpenLineage offers a unified schema. Spark, Flink, Airflow, and warehouses can all emit to a central lineage backend you control. One graph, one query API, full visibility across your heterogeneous stack.

The trade off: you own the reliability and scalability of that central service. You must handle ingestion spikes, query optimization, and retention. For smaller teams, vendor solutions are faster. For large multi cloud platforms, open standards avoid fragmentation.

"The question isn't whether to implement lineage. It's what granularity you need and whether you can afford the operational overhead of doing it right."

💡 Key Takeaways

✓Table level lineage handles 80 to 90% of use cases at 100x lower cost than column level, ideal for impact analysis and schema change planning

✓Column level lineage is mandatory for compliance and governance questions requiring proof of specific data flows, but generates millions of links for large schemas

✓Automatic lineage covers most flows with zero friction but misses business context, manual imports, and semantic meaning that manual curation provides

✓Vendor lineage solutions offer zero ops overhead and polished UIs but fragment visibility across multi cloud stacks, while OpenLineage unifies at the cost of owning infrastructure

✓The right strategy starts with automatic table level lineage for broad coverage, then adds column level tracking and manual curation to the 20% of datasets that are compliance critical

📌 Interview Tips

1A financial services company uses column level lineage to prove to auditors that <code>customer.ssn</code> never flows to any third party analytics system, satisfying regulatory requirements.

2A startup with 500 tables and a single cloud uses BigQuery native lineage with zero configuration. As they grow to 5,000 tables across AWS and GCP, they migrate to OpenLineage to unify the graph.

3An e-commerce platform uses table level lineage for daily operations (impact analysis, debugging). They add column level tracking only for PII fields (email, address, payment info) to satisfy GDPR data mapping requirements.

4A media company annotates their top 100 datasets manually in their catalog with business owners and SLAs. Automatic lineage handles the other 9,900 tables, giving 90% coverage at 10% of the curation effort.

← Back to Data Lineage Tracking & Visualization Overview