What is Data Federation?

Definition
Data Federation is a virtual data layer that allows you to query multiple physically separate data sources as if they were one unified database, without copying or moving the data first.
The Problem: Large organizations have data scattered across dozens or hundreds of systems. You might have customer orders in a relational database, product catalogs in MongoDB, support tickets in Salesforce, clickstream events in S3, and financial data in an on premises warehouse. Business analysts want to join this data for reports, but building Extract, Transform, Load (ETL) pipelines for every combination is expensive and slow.

How Federation Solves This: Instead of physically moving data into a central warehouse first, federation provides a query interface that looks like one database but actually reaches out to multiple sources at query time. When you run a query, the federation engine breaks it into pieces, sends subqueries to each relevant system, and stitches the results back together.

Think of it like a universal remote control. Instead of having separate remotes for your TV, sound system, and streaming device, you have one interface that coordinates all of them. The devices stay separate, but you control them through a unified interface.

Key Components: A federation system needs several parts. First, a federation engine that accepts queries and coordinates execution. Second, connectors for each data source that know how to communicate with different systems (SQL databases, REST APIs, cloud storage). Third, a metadata catalog that maps the unified schema to actual source schemas. Fourth, a query optimizer that decides how to execute efficiently. Finally, a security layer that enforces access controls.

✓ In Practice: When an analyst writes SELECT * FROM customers JOIN orders, the federation engine might route the customers query to Salesforce via REST API and the orders query to PostgreSQL via SQL, then join the results locally before returning them.

The value proposition is simple: access fresh data from multiple sources without building and maintaining complex ETL pipelines. The freshness is real time because you always query the current state of each system.

💡 Key Takeaways

✓Federation provides a virtual unified view over physically separate data sources without copying data

✓Queries are decomposed into subqueries that execute against each source system at runtime

✓Components include federation engine, source connectors, metadata catalog, query optimizer, and security layer

✓Data stays fresh because you always query the current state of each system, avoiding ETL latency

✓Trade off is operational simplicity for runtime dependencies on multiple upstream systems

📌 Interview Tips

1Amazon Athena federated queries allow joining S3 data with RDS databases and SaaS systems through a single SQL interface

2Presto and Trino engines at Meta support connectors for HDFS, object storage, and operational databases for cross system analytics

3Google BigQuery Omni uses federation to query data across multiple clouds without data movement

← Back to Data Federation Patterns Overview