Lineage Granularity: Table vs Column vs Row Level Trade-offs
Table Level Lineage
Lineage granularity determines what you can trace and at what cost. Table level lineage tracks dependencies between datasets as whole units, answering which tables feed into which downstream tables. This is cheap to collect and query, scaling to millions of tables with minimal overhead, but provides only coarse impact analysis. If an upstream table schema changes, you know all downstream tables are affected but not which specific columns or rows break.
Column Level Lineage
Column level lineage maps individual columns through transformations, tracking that output column revenue depends on input columns price and quantity through a multiplication. This enables precise schema evolution safety. When renaming or dropping a column, you see exactly which downstream columns lose their inputs. The overhead is moderate: metadata size grows with column count times transformation fan out, and collection requires parsing query plans or instrumenting data frames to track column projections. At enterprise scale with thousands of tables averaging 50 columns each, this still remains tractable at tens of millions of edges.
Row Level Lineage
Row level lineage traces individual output rows back to the specific input rows that produced them, essential for auditing PII flows or right to be forgotten compliance. The cost is severe: storage inflates 5 to 20 times because you maintain per row mappings, and compute overhead doubles as joins must track provenance vectors.
Practical Strategy
Reserve row level lineage for subsets flagged by policy, such as financial transactions or medical records, or use sampling to trace a representative 1 to 10 percent of rows for debugging rather than full coverage.