Loading...
Data Warehousing Fundamentals • Modern Data Warehouse ArchitectureMedium⏱️ ~3 min
How Modern Warehouse Architecture Works at Scale
The End to End Flow:
At FAANG scale, you have hundreds of microservices each generating thousands of writes per second to their own OLTP databases, plus event streams emitting 100k to 1M events per second. The warehouse needs to ingest this data, transform it, and make it queryable within minutes, all without impacting production systems.
Change Data Capture (CDC) for OLTP Sources:
Running
Medallion Transformation Pattern:
Raw data lands as bronze tables: immutable, append only files with exactly the schema from the source. A transformation layer reads bronze, applies cleaning and normalization, and writes silver tables with conformed schemas. Finally, it computes business metrics and aggregates into gold tables optimized for specific use cases.
At Uber scale, daily transformation jobs scan tens of petabytes but finish within 1 to 3 hour Service Level Agreements (SLAs) by distributing work across thousands of compute cores. The key is that compute can scale horizontally: add more workers to process more data in the same time.
Serving with Elastic Compute:
The serving layer runs a distributed SQL engine that reads from curated gold tables. When 500 analysts arrive at 9am and start querying, the system automatically spins up more compute nodes. When load drops in the evening, it scales back down. This elasticity is the breakthrough: you pay for compute only when you use it, and you can handle spikes without manual intervention.
Production dashboards query with 10 to 30 second timeouts. Interactive exploration targets sub second to few second latency by leveraging result caching, materialized views of common aggregates, and columnar storage that only reads the columns you need.
SELECT * FROM users every hour on a 100 million row table would crush your production database. Instead, CDC pipelines monitor the database transaction log and capture only inserts, updates, and deletes as they happen.
The pipeline publishes these changes to a message queue, then writes them to the data lake as partitioned files (typically partitioned by date and hour). End to end lag from database commit to data lake landing is typically under 1 minute. This gives you four nines durability in the lake without putting read load on OLTP systems.
Streaming Ingestion for Events:
User clicks, application logs, and IoT telemetry arrive as event streams. A streaming ingestion layer consumes from message queues like Kafka, batches events into files every few minutes, and writes them to the lake partitioned by event time. At companies like Netflix, this might mean ingesting 50 TB daily from click streams alone.
Ingestion Latency Targets
<1 min
CDC LAG
2-5 min
EVENT STREAMS
💡 Key Takeaways
✓CDC pipelines capture only changed rows from OLTP databases by reading transaction logs, achieving under 1 minute lag without impacting production read/write performance.
✓Medallion pattern organizes data as bronze (raw, immutable), silver (cleaned, conformed), and gold (business metrics), with each stage independently scalable.
✓Streaming ingestion batches events every few minutes into partitioned files. At Netflix scale, this means 50 TB daily from click streams with 2 to 5 minute end to end latency.
✓Elastic compute in the serving layer scales from 10 to 100+ nodes in minutes to handle query spikes, then scales down to save costs during low usage periods.
📌 Examples
1A CDC pipeline on a 100 million row users table monitors the transaction log and publishes 5,000 changes per second to the data lake with 30 second lag, versus full table scans that would take 10+ minutes and lock production.
2Daily transformation jobs at Uber scan tens of petabytes across thousands of cores and complete within 1 to 3 hours, meeting SLAs for morning dashboard refresh.
Loading...