Lineage Granularity Levels: Table vs Column

The Granularity Spectrum:

Data lineage operates at different levels of detail, each with distinct trade offs. The two critical levels are table level and column level lineage, though cross system lineage adds another dimension.

Table Level Lineage:

This tracks dependencies between entire datasets. If table B depends on table A, you capture that relationship. Simple and cheap to compute. Your lineage system records that the job read from orders and customers tables, then wrote to order_summary. You know if orders changes, order_summary might be affected.

This works well for coarse impact analysis. If you're changing a table schema, table level lineage tells you which downstream jobs to check. At scale, a platform with 10,000 datasets might have 50,000 table level links, which is manageable.

Column Level Lineage:

This tracks that order_summary.total_price was derived specifically from orders.unit_price and orders.tax_rate. Much more powerful for governance and debugging. If a regulator asks where the customer_email field originates and where it flows, column level lineage gives precise answers.

The catch is complexity. To derive column level dependencies, you must parse query text, build abstract syntax trees, or instrument dataflow plans. For very wide tables with hundreds of columns, or heavily denormalized schemas, the number of column links explodes. BigQuery limits this to roughly 1,500 column level links per job to keep metadata volume manageable.

Storage & Compute Cost
50K
TABLE LINKS
5M+
COLUMN LINKS
Cross System Lineage:

This connects objects across platforms. A Kafka topic feeds a Spark job that writes to a warehouse that powers a BI dashboard. Each system uses different naming conventions and metadata formats. Cross system lineage stitches these together into a unified graph, letting you trace from business report all the way back to raw event logs.

At production scale with multi cloud deployments, this becomes essential. Your data might flow from AWS S3 to Google BigQuery to Tableau. Without cross system lineage, the graph has breaks, and you lose end to end visibility.

⚠️ Common Pitfall: Column level lineage requires parsing complex queries or instrumenting dataflow engines. User defined functions and ML transforms may not yield precise column mappings, leading to conservative over linking where many columns appear dependent even if they're not.

💡 Key Takeaways

✓Table level lineage is cheaper and sufficient for coarse impact analysis, typically generating 50,000 links for 10,000 datasets

✓Column level lineage provides precision for governance and compliance but can generate millions of links and requires query parsing or dataflow instrumentation

✓BigQuery and similar warehouses limit column level links to roughly 1,500 per job to control metadata volume and computation overhead

✓Cross system lineage connects multiple platforms (Kafka, Spark, warehouses, BI tools) into a unified graph, critical for multi cloud environments

✓Column level tracking becomes expensive for wide tables with hundreds of columns or denormalized schemas, forcing teams to choose which fields to track

📌 Interview Tips

1Table level: Job reads <code>orders</code> and <code>customers</code>, writes <code>order_summary</code>. If <code>orders</code> schema changes, you know to check <code>order_summary</code>.

2Column level: <code>order_summary.total_price</code> = <code>orders.unit_price</code> * <code>orders.quantity</code> + <code>orders.tax</code>. If you remove <code>orders.tax</code>, lineage shows exactly which downstream columns break.

3Cross system: Event originates in Kafka topic <code>user.clicks</code>, flows to Spark job <code>sessionize_clicks</code>, lands in BigQuery table <code>analytics.sessions</code>, feeds Tableau dashboard <code>daily_engagement</code>. One unbroken lineage chain.

4A compliance audit asks where <code>user.ssn</code> flows. Column level lineage traces it through 23 transformations across 5 systems, proving it never enters the marketing database.

← Back to Data Lineage Tracking & Visualization Overview