What is Data Warehouse Architecture?

A data warehouse is an analytical data store built specifically for high throughput scans, wide joins, and aggregations over large volumes of historical data. Unlike transactional databases optimized for single row lookups and writes, warehouses are designed to answer business questions that scan millions or billions of rows in seconds.

The canonical three tier architecture separates concerns cleanly. The bottom tier handles storage and ingestion, where Extract Transform Load (ETL) or Extract Load Transform (ELT) pipelines bring data in. The middle tier provides the analytical serving layer, typically a Massively Parallel Processing (MPP) engine or Online Analytical Processing (OLAP) cube that executes queries across many nodes simultaneously. The top tier is where business users consume data through Business Intelligence (BI) tools, dashboards, and data science notebooks.

Modern warehouses often use medallion layering to organize data quality and transformation stages. Bronze layer holds raw, immutable data exactly as ingested with full provenance metadata. Silver layer contains cleaned, conformed data with Slowly Changing Dimensions (SCDs) and standardized business keys. Gold layer provides business ready marts with pre-aggregated metrics and optimized layouts. For example, Google BigQuery commonly scans 50 to 200 GB per dashboard refresh, completing in 2 to 15 seconds when tables are well partitioned and clustered.

Performance comes from columnar storage that reads only needed columns, aggressive compression often yielding 5 to 10 times size reduction, partitioning that prunes irrelevant data blocks, and vectorized execution that processes thousands of rows per CPU instruction. Amazon Redshift clusters scan data at aggregate rates of tens to hundreds of GB per second when sort keys and distribution strategies align with query patterns.

💡 Key Takeaways

✓Three tier architecture separates ingestion (bottom), analytical execution (middle), and consumption (top) for clear separation of concerns and independent scaling

✓Columnar storage with compression reduces 1 petabyte of raw data to 100 to 300 terabytes stored, enabling faster scans by reading only required columns

✓Massively Parallel Processing (MPP) distributes query execution across nodes, achieving scan rates of tens to hundreds of GB per second on well configured clusters

✓Medallion layers (bronze raw, silver conformed, gold business ready) establish clear data quality gates and transformation responsibilities between pipeline stages

✓Google BigQuery typically completes dashboard queries scanning 50 to 200 GB in 2 to 15 seconds with thousands of concurrent workers when partitioning is optimized

✓Enterprise warehouses commonly handle 10,000 to 100,000 daily queries across 50 to 500 active users with tens to hundreds of terabytes in managed storage

📌 Interview Tips

1Amazon Redshift cluster with 10 nodes, each scanning 10 GB per second, completes a 500 GB analytical query in approximately 5 seconds with proper sort keys and distribution

2Google BigQuery project storing 300 TB after 5x compression from 1.5 PB raw data, with tables partitioned by event date and clustered by customer ID for fast user cohort analysis

3Bronze to silver transformation applying schema validation, deduplication by source transaction ID, and SCD Type 2 tracking for customer dimension changes with effective date ranges

← Back to Data Warehousing Architecture Overview