How Data Virtualization Works: Query Execution

The Core Mechanism:

When a user runs a query against a virtual table, the virtualization engine performs a complex orchestration dance. It starts by consulting its metadata catalog, which maps virtual schemas to physical sources. This catalog contains crucial information about where each attribute lives, what data types exist in each source, and statistics like row counts and query latency.

The engine then generates an execution plan using cost based optimization. It analyzes the query to determine which parts can be pushed down to source systems and which must be computed in the virtualization layer. For example, filtering on customer_id = 12345 gets pushed to each source to minimize data transfer. Aggregations like SUM(order_total) might be pushed to a warehouse but computed locally for a slow API.

1
Parse and Plan: Engine parses the SQL query, consults metadata to understand which sources can serve each attribute, and generates an optimized execution plan with parallel subqueries.
2
Predicate Pushdown: Filters, projections, and aggregations are pushed to source systems to reduce data transfer. A filter on high cardinality columns reduces result sets by 10x to 100x before network transmission.
3
Parallel Execution: Subqueries execute concurrently across sources. Platforms may issue 10 to 50 parallel requests per large source to maximize throughput while respecting rate limits.
4
Join and Finalize: Partial results stream back to the engine, which performs in memory joins, applies transformations, and enforces security policies before returning the final result set.
Real World Performance:

Consider a dashboard querying last week revenue by region and marketing channel. The query touches a Snowflake warehouse with 500 TB of historical events, a CRM for campaign metadata, and 3 regional PostgreSQL databases for recent orders not yet loaded to the warehouse.

Typical Query Execution
300-800ms
PER SOURCE P95
< 2 sec
END TO END P95

The engine issues parallel subqueries with pushed down filters. Snowflake returns aggregated revenue in 400ms. The CRM API responds with campaign data in 600ms. PostgreSQL instances each return recent orders in 200ms to 300ms. The engine joins results and finalizes in under 100ms, achieving sub 2 second total latency for the dashboard.

💡 Key Takeaways

✓Execution involves cost based optimization that decides which operations to push to sources versus compute locally based on statistics like cardinality and latency

✓Predicate pushdown is critical for performance, reducing data transfer by 10x to 100x by filtering at the source before network transmission

✓Parallel execution across sources is essential, with platforms issuing 10 to 50 concurrent subqueries per large source to maximize throughput

✓Typical production systems target per source p95 latencies of 300 to 800ms and end to end query latencies under 2 seconds for dashboard use cases

📌 Interview Tips

1A query for customer lifetime value joins a data warehouse (historical purchases), Salesforce (customer segments), and PostgreSQL (recent orders). The engine pushes date filters to all sources, reducing warehouse scan from 500 TB to 5 GB, fetches 10,000 customer records from Salesforce, and retrieves 50,000 recent orders from PostgreSQL, joining all in memory in under 1.5 seconds.

2Platforms like Denodo use query caching to improve repeat dashboard performance. After the first execution taking 2 seconds, subsequent identical queries within a 5 minute time to live window return results in under 100ms from cache, reducing load on backend systems.

← Back to Data Virtualization Techniques Overview