Data Lakes & Lakehouses • Data Discovery & SearchHard⏱️ ~3 min
Centralized vs Federated: Catalog Architecture Trade-offs
The Governance Tension:
Designing a discovery system forces a fundamental choice: centralized catalog with global control, or federated catalogs with domain autonomy. This is not just an implementation detail. It shapes how fast teams can move, how consistent governance is, and where bottlenecks appear.
Centralized: Control and Consistency:
A centralized catalog means one metadata store, one schema, one set of policies. All teams publish metadata to the same system. This simplifies search: users query one place and get globally consistent results. Governance is easier because you can enforce naming conventions, require ownership tags, and apply classification rules uniformly.
The downside is scalability of operations, not technology. The central team becomes a bottleneck for onboarding new data sources, approving schema changes, and handling support requests. If 200 teams want to publish datasets, the central team reviews 200 integration requests. Latency to add a new source can stretch to weeks.
Federated: Autonomy and Scale:
Federated catalogs follow the Data Mesh philosophy: domains own their catalogs and a global search layer aggregates them. Each domain team manages their own metadata, schema evolution, and documentation. The central platform provides standards and APIs, but domains have autonomy.
This scales operationally: the payments domain can add 50 new datasets without asking permission from a central team. But it makes global consistency harder. Domains might use different naming conventions or classification schemes. Cross domain lineage becomes complex when you need to trace data through multiple federated catalogs.
When to Choose Each:
The decision depends on organizational structure and data maturity.
Choose centralized when you have fewer than 20 to 30 data producing teams, strong regulatory requirements demanding uniform governance, or a centralized data platform team with capacity to scale operations. Financial services and healthcare companies often go this route because consistent compliance enforcement is critical.
Choose federated when you have 50 plus autonomous product teams, domain teams with strong data engineering capability, or a culture prioritizing speed over consistency. Tech companies with mature Data Mesh implementations, where domains already own production services end to end, often extend this to metadata ownership.
Hybrid Reality:
Many large organizations adopt a hybrid: strong central catalog with domain level curation rights. The central platform owns the infrastructure, schema, and core policies. Domains have delegated authority to document, classify, and deprecate their own datasets without central approval. Global policies like PII detection run centrally, but domains can add custom business metadata.
Centralized Catalog
Single source of truth, consistent policies, potential bottleneck
vs
Federated Catalogs
Domain autonomy, harder global consistency, scalable ownership
"The right answer is not purely centralized or purely federated. It is where you place the control points: centralize what must be consistent (security, lineage), federate what benefits from local knowledge (documentation, business context)."
💡 Key Takeaways
✓Centralized catalogs ensure consistent governance and simplify search, but the central team can become an operational bottleneck for 50 plus data producing teams
✓Federated catalogs following Data Mesh principles scale operationally by giving domains autonomy, but make global consistency and cross domain lineage harder
✓Choose centralized for fewer than 20 to 30 teams with strong compliance needs; choose federated for 50 plus autonomous teams with mature data engineering
✓Hybrid models centralize what must be consistent like security policies and lineage tracking while federating local knowledge like documentation and business context
📌 Examples
1A financial services company with 25 data teams uses a centralized catalog to enforce uniform PII classification required by regulators, with a central team approving all new data source integrations
2A tech company with 80 product teams uses federated catalogs where each domain manages their own metadata, while a global search layer aggregates all domains and enforces only critical security policies centrally