Data Integration PatternsData Federation PatternsMedium⏱️ ~3 min

Federation at Production Scale

Real World Architecture: Consider a global retailer with transactional orders in sharded MySQL databases (1,200 shards across 3 regions), product catalog in MongoDB (40 TB), customer data in Salesforce (150 million records), clickstream events in S3 (50 PB), and finance data in an on premises Teradata warehouse (500 TB). Business users need unified Customer 360 views and cross domain analytics. A federation layer provides a single SQL interface over all these sources. When an analyst queries active customers with recent purchases and support tickets, the engine identifies that customers map to Salesforce, orders to MySQL, and tickets to a support system. It generates optimized subplans with filters pushed down, executes in parallel, and joins results locally. Scale Characteristics: At FAANG scale, you might support 50 to 200 concurrent Business Intelligence (BI) users, each triggering 1 to 5 queries per minute during business hours. Peak load reaches 15 to 20 queries per second. Most queries touch 2 to 4 source systems. Federation remains viable when most subqueries complete in 200 to 500 milliseconds and network is optimized (same region or fast cross region links).
Production Query Workload
15-20 QPS
PEAK LOAD
2-4
SOURCES/QUERY
200-500ms
SUBQUERY TIME
Hybrid Pattern in Practice: Pure federation for everything rarely works at scale. The winning pattern is hybrid: hot aggregates and high traffic queries run against a central warehouse, while federation handles long tail and exploratory cross system queries not worth building pipelines for. Amazon Athena federated queries demonstrate this. Core analytics run on data in S3 and Redshift (the warehouse). But when analysts need to join S3 data with live RDS records or Salesforce data, federation fills the gap without building a full ETL pipeline. This covers maybe 5 to 15 percent of query volume but 40 to 60 percent of distinct query patterns. Cost and Operational Impact: Federation cuts integration costs by 40 to 80 percent for long tail use cases. Onboarding a new source takes hours instead of weeks of ETL development. You avoid copying terabytes daily and reduce load on operational systems that would otherwise serve as ETL sources. However, heavy analytical workloads (daily aggregates on 50 to 100 TB) are cheaper and faster in a warehouse with columnar storage, clustering, and dedicated compute. A 500 GB aggregate that scans 50 TB in a warehouse might cost $15 and finish in 3 minutes. The same via federation, pulling data from 8 sources, might take 20 to 40 minutes and cost $45 in compute and network egress. Governance and Protection: Federation respects source level permissions, applying row filters and column masking. It logs all queries for compliance. To protect operational systems, federation engines limit per source concurrency (maybe 5 to 10 concurrent queries to a production database) and use caching or materialized views for expensive patterns. Without these guards, analysts could overwhelm the homepage database with ad hoc joins.
💡 Key Takeaways
Hybrid pattern wins at scale: warehouse for hot aggregates, federation for long tail cross system queries (5 to 15 percent of volume, 40 to 60 percent of patterns)
Federation cuts integration costs by 40 to 80 percent for exploratory queries by avoiding ETL pipeline development
Per source concurrency limits (5 to 10 queries) and caching protect operational systems from analytical load
Heavy analytical workloads (50 to 100 TB scans) remain cheaper and faster in a warehouse: 3 minutes and $15 vs 30 minutes and $45 via federation
Production scale: 15 to 20 queries per second peak, touching 2 to 4 sources per query, targeting p95 under 3 to 5 seconds
📌 Examples
1Amazon Athena uses federation to join S3 data with RDS and Salesforce, handling 5 to 15 percent of query volume but 40 to 60 percent of distinct patterns
2Global retailer federates across 1,200 MySQL shards, 40 TB MongoDB catalog, 150 million Salesforce records, 50 PB S3 events, and 500 TB Teradata warehouse
3Meta uses Presto/Trino to federate HDFS, object storage, and operational stores for cross system analytics not served by central warehouse
← Back to Data Federation Patterns Overview