ETL/ELT PatternsIncremental Processing PatternsMedium⏱️ ~3 min

Layered Architecture and Medallion Pattern

Why Layered Processing Matters: At scale, you cannot apply all transformations in a single pass. Raw events are messy: duplicates from retries, schema variations, late arrivals, malformed JSON. Business logic is complex: multi way joins, slowly changing dimensions, aggregations across time windows. Running everything in one pipeline creates fragile code that is impossible to debug or evolve. The solution is layered incremental processing, often called the medallion architecture: bronze, silver, and gold layers. Each layer incrementally consumes from the previous, applies specific transformations, and serves different use cases. Bronze Layer: Raw Ingestion: Bronze captures the immutable truth: every event exactly as it arrived. A rideshare platform writes ride events, driver pings, and payment records to bronze with p99 latency under 1 second. Data is partitioned by arrival date and source system. This layer uses minimal transformation. Validate JSON structure, add ingestion metadata like ingestion_timestamp and source_offset, then write to cheap object storage in formats like Parquet or Avro. Retain everything for compliance and disaster recovery, even duplicates and malformed records. Typical scale: 10 billion events per day equals roughly 50 TB daily in compressed Parquet. Bronze holds 90 to 365 days of history, totaling 4.5 PB to 18 PB. Silver Layer: Cleaned and Joined: Silver applies business logic incrementally. Read new bronze partitions, deduplicate events using business keys (ride ID, transaction ID), join with dimension tables (user profile, driver status), and compute derived fields like trip distance or surge multiplier. This layer enforces schema, handles late arriving data with time based windows, and resolves conflicts. For CDC streams, silver merges insert, update, and delete events into slowly changing dimension tables using techniques like Type 2 dimensions with valid_from and valid_to timestamps. Silver updates run every 1 to 5 minutes for near real time use cases or every 15 to 60 minutes for batch oriented workloads. The output is clean, queryable tables used by downstream analytics.
Silver Layer Processing SLA
5 min
MAX LATENCY
99.9%
AVAILABILITY
10M
ROWS PER RUN
Gold Layer: Aggregated Metrics: Gold serves analysts and dashboards with pre aggregated, highly optimized tables. Incremental jobs read silver, compute hourly or daily rollups (city level demand, driver earnings, conversion funnels), and update gold tables. Gold tables are often materialized views with business specific schemas. They use aggressive partitioning and clustering for query performance. Analysts expect sub 5 second query latency at p95 for common dashboard queries, which is only possible because gold pre computes expensive aggregations. The trade off: gold lags silver by one additional processing window (5 to 15 minutes typical). For use cases requiring real time metrics, applications query silver directly despite higher query cost.
"Bronze holds the truth, silver enforces the rules, and gold serves the business. Each layer is incrementally updated and independently debuggable."
💡 Key Takeaways
Bronze layer stores raw, immutable events with minimal transformation: 50 TB daily at 1 second p99 latency, partitioned by date and source for compliance and replay
Silver layer applies incremental business logic: deduplication, joins, schema enforcement, running every 1 to 5 minutes with 99.9% availability SLA
Gold layer pre aggregates metrics for fast queries: sub 5 second p95 latency for dashboards, updated incrementally from silver with 5 to 15 minute additional lag
Each layer is independently testable and debuggable: failures in gold do not corrupt bronze, and you can replay transformations layer by layer
📌 Examples
1Uber processes 10 billion ride events daily through medallion architecture: bronze captures raw events in 1 second, silver deduplicates and joins driver and rider data in 3 minutes, gold computes hourly city demand metrics in 10 minutes
2Databricks Delta Lake enables concurrent reads and incremental writes: 100 analysts query gold tables with 99.9% availability while ETL jobs continuously update the same tables using ACID transactions
← Back to Incremental Processing Patterns Overview