Data Lakes & Lakehouses • Data Lake Architecture PatternsEasy⏱️ ~3 min
What is Data Lake Architecture?
Definition
Data Lake Architecture is a storage and processing pattern that uses cheap object storage as the foundation, holding raw and processed data in open formats (like Parquet or ORC), then layers compute engines, catalogs, and governance on top to make that data queryable and manageable at scale.
/raw/clickstream/2024-01-15/.
Second, the compute layer runs processing jobs. These are temporary clusters (Spark, Presto, or similar) that spin up, read files from the lake, process them, write results back, then shut down. You pay only for compute time.
Third, the catalog layer tracks metadata. Services like AWS Glue, Databricks Unity Catalog, or Google Data Catalog maintain a registry of what datasets exist, their schemas, locations, and owners. This makes the lake discoverable instead of a chaotic dump.
✓ In Practice: Netflix stores raw viewing events, transformed session summaries, and curated recommendation training datasets all in the same lake. Different teams query at different layers using different engines, all reading from the same underlying storage.
The Key Innovation:
The pattern decouples who produces data from who consumes it. A mobile team can drop raw logs without knowing what analytics queries will run. An ML team can train models on years of historical data without copying it to a specialized store. All because storage is cheap and shared, while compute scales independently.💡 Key Takeaways
✓Storage and compute separation means you can store petabytes cheaply (around 2 to 3 dollars per TB per month) while spinning compute up and down as needed, paying only for actual processing time
✓Open file formats like Parquet or ORC enable any compute engine to read the data without vendor lock in, so you can switch from Spark to Presto without migrating storage
✓The catalog layer provides governance and discoverability, turning what could be a data swamp into a managed platform where teams can find and trust datasets
✓Raw ingestion happens independently from consumption, allowing producers to dump data without blocking on consumer requirements or schema design decisions
📌 Examples
1A streaming pipeline writes 10 million clickstream events per hour as compressed Parquet files partitioned by hour into <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">/raw/events/dt=2024-01-15/hour=14/</code>, costing about 50 dollars per TB per year in S3 Standard Infrequent Access
2An analytics team runs a daily Spark job that reads a week of raw events (roughly 1.7 billion events, 400 GB compressed), aggregates user sessions, and writes curated tables to <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">/curated/sessions/</code>, taking 20 minutes on a 50 node cluster
3A data scientist queries three years of historical session data (5 PB uncompressed, 800 TB compressed) for ML model training using a temporary 200 node cluster that runs for 4 hours then shuts down