What is Data Lake Architecture?

Definition
Data Lake Architecture is a storage and processing pattern that uses cheap object storage as the foundation, holding raw and processed data in open formats (like Parquet or ORC), then layers compute engines, catalogs, and governance on top to make that data queryable and manageable at scale.
The Core Problem:
Traditional data warehouses force you to model data upfront. When you're dealing with hundreds of data sources generating terabytes per day, this creates bottlenecks. You can't load semi structured logs, JSON events from mobile apps, or third party data dumps without extensive transformation first. That transformation work becomes a blocker for every new data source.

Data lake architecture solves this by separating storage from compute. Storage is object storage (like Amazon S3, Google Cloud Storage, or Azure Blob Storage) costing just a few dollars per terabyte per month. Data lands in raw form first, then multiple compute engines can process it independently.

How It Works:
Think of it as three distinct layers working together:

First, the storage layer holds everything as files in object storage. A clickstream event might be a compressed JSON file partitioned by date. There's no database managing this, just files organized in folders like /raw/clickstream/2024-01-15/.

Second, the compute layer runs processing jobs. These are temporary clusters (Spark, Presto, or similar) that spin up, read files from the lake, process them, write results back, then shut down. You pay only for compute time.

Third, the catalog layer tracks metadata. Services like AWS Glue, Databricks Unity Catalog, or Google Data Catalog maintain a registry of what datasets exist, their schemas, locations, and owners. This makes the lake discoverable instead of a chaotic dump.

✓ In Practice: Netflix stores raw viewing events, transformed session summaries, and curated recommendation training datasets all in the same lake. Different teams query at different layers using different engines, all reading from the same underlying storage.
The Key Innovation:
The pattern decouples who produces data from who consumes it. A mobile team can drop raw logs without knowing what analytics queries will run. An ML team can train models on years of historical data without copying it to a specialized store. All because storage is cheap and shared, while compute scales independently.

💡 Key Takeaways

✓Storage and compute separation means you can store petabytes cheaply (around 2 to 3 dollars per TB per month) while spinning compute up and down as needed, paying only for actual processing time

✓Open file formats like Parquet or ORC enable any compute engine to read the data without vendor lock in, so you can switch from Spark to Presto without migrating storage

✓The catalog layer provides governance and discoverability, turning what could be a data swamp into a managed platform where teams can find and trust datasets

✓Raw ingestion happens independently from consumption, allowing producers to dump data without blocking on consumer requirements or schema design decisions

📌 Interview Tips

1A streaming pipeline writes 10 million clickstream events per hour as compressed Parquet files partitioned by hour into <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">/raw/events/dt=2024-01-15/hour=14/</code>, costing about 50 dollars per TB per year in S3 Standard Infrequent Access

2An analytics team runs a daily Spark job that reads a week of raw events (roughly 1.7 billion events, 400 GB compressed), aggregates user sessions, and writes curated tables to <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">/curated/sessions/</code>, taking 20 minutes on a 50 node cluster

3A data scientist queries three years of historical session data (5 PB uncompressed, 800 TB compressed) for ML model training using a temporary 200 node cluster that runs for 4 hours then shuts down

← Back to Data Lake Architecture Patterns Overview