Data Integration Patterns • Data Federation PatternsMedium⏱️ ~3 min
How Federation Engines Execute Queries
The Execution Pipeline: When a federation engine receives a query, it goes through several sophisticated steps that determine whether your query returns in 500 milliseconds or times out after 30 seconds.
The Critical Optimization: Predicate Pushdown
This technique makes or breaks federation performance. Consider querying customer orders from the last 30 days where
Join Strategies Matter: When joining data from two sources, the engine must choose a strategy. For a broadcast join, if one side is small (10,000 rows), it broadcasts that table to where the large table lives. For large to large joins across systems, performance degrades sharply because massive amounts of data must move. Smart engines detect this pattern and either refuse the query, require hints, or suggest materializing one side in a shared store first.
Real World Performance: At moderate scale with 50 to 200 concurrent users, federation systems target p50 latency under 1 second and p95 under 3 to 5 seconds for interactive analytics. This requires most subqueries to finish in 200 to 500 milliseconds and network round trips to stay within the same region. Cross region federation adds 100 to 200 milliseconds of latency per hop.
1
Parse and Validate: The engine parses your SQL against a unified schema stored in its metadata catalog. It validates that tables and columns exist and that you have permission to access them.
2
Generate Logical Plan: Creates an abstract representation of operations (scan, filter, join, aggregate) without considering physical sources yet.
3
Map to Sources: Determines which parts of the query target which physical systems. A join between
customers and orders might map to Salesforce and PostgreSQL respectively.4
Optimize with Pushdown: Applies predicate pushdown to send filters to source systems. Instead of pulling 10 million customer rows then filtering, send WHERE
region='US' to Salesforce so only 2 million rows return.5
Execute in Parallel: Issues subqueries concurrently, streams partial results back, performs joins and aggregations locally, then returns the final result.
total > 1000. Without pushdown, you pull all 50 million orders across the network (maybe 500 GB), then filter locally. With pushdown, you send the filter criteria to the source database, which applies indexes and returns only 200,000 matching rows (2 GB).
Network Transfer Impact
WITHOUT PUSHDOWN
500 GB
→
WITH PUSHDOWN
2 GB
💡 Key Takeaways
✓Predicate pushdown reduces network transfer by 100x to 250x by filtering at the source instead of pulling all data first
✓Parallel subquery execution is critical: queries to 3 sources run concurrently, not sequentially
✓Join strategies differ: broadcast small tables (under 10,000 rows), avoid large to large cross system joins
✓Target p50 under 1 second and p95 under 3 to 5 seconds for interactive analytics with 50 to 200 concurrent users
✓Query optimizer decides which operations execute where, aiming to minimize data movement between systems
📌 Examples
1A query joining 10 million CRM customers with 50 million database orders: pushdown filters reduce CRM transfer from 800 MB to 60 MB and database transfer from 5 GB to 200 MB
2Broadcast join: 5,000 row product catalog broadcasted to where 2 million order lines live, avoiding moving orders across network
3Cross region federation between US East and EU West adds 150ms latency per hop, making multi hop queries exceed 5 second p95 targets