What Are Data Lakes and How Do They Work?

Data lakes are storage architectures that decouple storage from compute, storing raw, heterogeneous data in object storage (structured, semi structured, and unstructured formats). Unlike traditional data warehouses that require upfront schema design (schema on write), data lakes use schema on read: ingest everything first, then interpret and validate later when processing. This flexibility allows you to store JSON logs, CSV files, Parquet tables, images, and videos all in the same system.

The typical organization follows a three layer pattern. The raw or bronze layer contains immutable source copies exactly as ingested. The refined or silver layer holds cleaned and conformed data with quality checks applied. The curated or gold layer provides analytics ready datasets optimized for specific use cases. This progression minimizes upfront modeling effort while supporting diverse workloads like Machine Learning (ML) feature extraction, offline analytics, and long term archival.

Storage costs are remarkably low, around $20 to $30 per terabyte per month for standard object storage tiers, with durability of 99.999999999% (eleven nines) and availability near 99.99% (four nines). This economic model enables companies like Meta to operate exabyte scale data lakes with thousands of daily pipelines ingesting petabytes per day. However, without proper governance, lakes can become data swamps: undocumented datasets, conflicting schemas, and runaway storage costs.

The tradeoff is clear. You get massive scale and flexibility at minimal storage cost, but SQL query performance suffers compared to warehouses. Typical freshness is minutes to hours rather than seconds, and you sacrifice governance unless you add additional tooling layers.

💡 Key Takeaways

•Schema on read allows ingesting data first without upfront modeling, supporting JSON, CSV, Parquet, images, and video in one system

•Three layer architecture: bronze (raw immutable copies), silver (cleaned data), gold (analytics ready), minimizing upfront design

•Object storage costs $20 to $30 per TB per month with 99.999999999% durability, enabling exabyte scale like Meta's petabytes per day ingestion

•Freshness typically ranges from minutes to hours, acceptable for ML training and offline analytics but slower than warehouse subsecond queries

•Data swamp risk: without catalogs and governance, lakes accumulate undocumented datasets, conflicting schemas, duplicate tables, and runaway costs

📌 Examples

Meta operates an exabyte scale data lake with thousands of daily pipelines ingesting petabytes per day, querying terabytes to petabytes per interactive session

Typical cloud setup: S3 or Azure Data Lake Storage at $23/TB/month, storing raw application logs (JSON), database exports (Parquet), and ML training images (PNG/JPEG) in same bucket hierarchy

← Back to Data Lakes & Lakehouses Overview