Data Governance & Lineage • Data Governance FrameworkEasy⏱️ ~3 min
What is Data Governance Framework?
Definition
Data Governance Framework is a set of policies, roles, and technical controls that define how data is created, classified, accessed, transformed, and retired across an organization. It ensures consistent, high quality, compliant data usage at scale.
1
Data discovery and metadata: A catalog of datasets, schemas, owners, Service Level Agreements (SLAs), classifications, and lineage.
2
Data quality and observability: Automated checks on freshness, completeness, accuracy, and distribution shifts.
3
Security, privacy, and compliance: Classification of PII, access control, encryption, retention, and auditability.
4
Lineage and change management: End to end tracking of how data flows through pipelines and who changed what.
5
Lifecycle and ownership: Clear roles for data owners, with defined retention, deprecation, and documentation responsibilities.
💡 Key Takeaways
✓Data governance is not a compliance checkbox but a set of machine enforceable policies embedded in data infrastructure
✓Without governance, large organizations face incidents like incorrect billing, privacy violations, and broken ML models from inconsistent data usage
✓The five pillars are metadata/discovery, quality/observability, security/privacy, lineage/change tracking, and lifecycle/ownership
✓At scale (100K+ datasets), manual governance breaks down, requiring automated metadata and policy systems
✓Companies like LinkedIn use DataHub as central metadata systems, with p99 latency targets under 200ms for catalog queries
📌 Examples
1A streaming system ingesting 500,000 events per second (1KB each, 40GB per hour) requires governance to track data classification, owners, and quality metrics across all downstream consumers
2When onboarding a new payment events table, engineers register it with owner (payments team), classification (restricted, contains PII), freshness SLA (updated every 5 minutes), and retention policy (raw data 30 days, aggregates 2 years)
3LinkedIn's DataHub manages lineage graphs with millions of nodes and edges, tracking relationships like 'dataset B was produced from A using job J version 3'