ETL/ELT Patterns • Transformation Layers (Bronze/Silver/Gold)Medium⏱️ ~3 min
Bronze Layer: The Raw Data Foundation
The Ingestion Challenge: At scale, your Bronze layer handles tens of thousands of events per second from hundreds of heterogeneous sources. A single microservice might emit clickstream events, a database might send Change Data Capture (CDC) records, and SaaS tools push API dumps. Each has different formats, schemas, and reliability characteristics. Bronze must ingest everything without becoming a bottleneck.
The design philosophy is simple: capture first, ask questions later. Bronze prioritizes throughput and durability over schema validation or usability. If a source sends malformed JSON, you store it anyway with metadata flagging the issue. If records arrive out of order or duplicated, you keep all copies. This seems wasteful, but it solves a critical problem: once you drop or transform data at ingestion, you can never recover it.
Storage Strategy: Bronze typically uses append only, partitioned storage in a data lake or lakehouse format like Delta Lake or Apache Iceberg. Partitions are usually by ingestion date or event timestamp, sometimes with a secondary partition by source system. For example,
Each record includes critical metadata:
/bronze/clickstream/date=2024-01-15/hour=14/ contains all clickstream events ingested during that hour.
Typical Bronze Ingestion Performance
50K
EVENTS/SEC
30-90s
END TO END LATENCY
source_system_id, ingestion_timestamp, schema_version, and sometimes a validation_status flag. This metadata becomes invaluable when debugging data quality issues months later. You can query: "Show me all records from source X that failed validation between midnight and 1am on January 15th."
Why This Matters: Regulatory and audit requirements often demand complete historical records. If your financial reporting has an error, auditors may require proof of what the source system actually said six months ago. Without Bronze, you have no ground truth. Additionally, when you discover a bug in Silver or Gold transformations, you can replay from Bronze with corrected logic rather than asking source systems to resend historical data (which they often cannot do).
⚠️ Common Pitfall: Teams sometimes try to "clean" data during Bronze ingestion to save storage costs. This breaks the entire model. If ingestion code drops malformed records silently, you lose forensic capability. If it transforms fields, you cannot replay with different logic. Bronze must be a faithful, immutable copy of source reality, warts and all.
The Storage Cost Trade Off: At petabyte scale, storing everything raw is expensive. If you ingest 100 TB daily and retain Bronze for 2 years, that's 73 PB of storage. At cloud storage rates of roughly $20 per TB per month, you're paying $1.5 million annually just for Bronze. This is why the medallion model requires organizational commitment: you trade cost for governance, lineage, and replay capability. Cheaper alternatives like keeping only Silver exist, but they sacrifice these benefits.💡 Key Takeaways
✓Bronze stores append only, lossless replicas with latency typically 30 to 90 seconds from source emission to storage availability
✓Each record includes metadata like source ID, ingestion timestamp, and schema version for forensic debugging and lineage tracking
✓Partitioning by date and source system enables efficient incremental consumption and targeted replay when bugs are discovered
✓Storage costs are substantial: 100 TB daily ingestion over 2 years costs approximately $1.5 million annually for Bronze alone
✓Never drop or transform data in Bronze; validation failures should be flagged in metadata but the raw record must be preserved
📌 Examples
1A clickstream event arrives with a malformed timestamp field. Bronze stores it with <code>validation_status='failed'</code> and <code>error_message='timestamp_parse_error'</code>. Six months later, when a dashboard shows anomalies, engineers can query Bronze to find all failed events from that period and determine if the issue was systemic.
2When a CDC stream from a database sends duplicate records due to a connector bug, Bronze keeps all copies with their original <code>ingestion_timestamp</code> values. Silver deduplication logic later removes duplicates based on business keys, but Bronze preserves evidence of the duplication for root cause analysis.