Data Governance & Lineage • Data Catalog SystemsEasy⏱️ ~3 min
What is a Data Catalog System?
Definition
A data catalog system is a centralized metadata repository that helps engineers and analysts discover, understand, and trust data across their organization by tracking what data exists, where it lives, how it's produced, and who uses it.
💡 Key Takeaways
✓Solves discovery, trust, and impact analysis problems that emerge when you have hundreds of data stores and thousands of tables
✓Stores metadata only (schema, lineage, ownership, quality) not actual data, acting as a search and knowledge layer on top of your data platform
✓Integrates with all systems through event ingestion and periodic crawling, typically becoming consistent within 30 to 120 seconds
✓Does not sit on the query path so it can be eventually consistent without impacting data processing performance
✓At large companies may cover tens of thousands of tables, petabytes of data, and thousands of daily users with 99.9 percent availability
📌 Interview Tips
1When an analyst searches for "monthly active users", the catalog returns ranked results showing which tables are blessed for that metric, who owns them, last update time, and downstream dashboards in under 300 ms
2When an engineer wants to deprecate a legacy table, the catalog queries its lineage graph to show all downstream jobs and dashboards that would break, preventing accidental outages
3A data pipeline writes a new table to the warehouse at 2:00 PM. The warehouse emits a metadata event. By 2:02 PM the catalog has discovered it, inferred schema, tagged the owner, and made it searchable