What is Delta Lake? The Transaction Problem in Data Lakes

Definition
Delta Lake is a transaction layer that sits on top of data lake storage (like Amazon S3 or Azure Data Lake Storage), providing ACID (Atomicity, Consistency, Isolation, Durability) guarantees for big data workloads.
The Core Problem: Traditional data lakes store files (typically Parquet format) in folders on object storage. You write files, register them in a metastore, and hope readers never see incomplete data. But this breaks down at scale.

Imagine an ecommerce company with 100 concurrent jobs writing clickstream data to the same table. Job A writes 50 files at 10:00am. Job B starts reading at 10:00:05am while Job A is still uploading files 30 through 50. Job B might see only the first 29 files, producing incorrect analytics. Even worse, if Job A crashes after writing 40 files, those orphaned files create data corruption that spreads to downstream reports.

How Delta Lake Solves This: Delta Lake introduces a transaction log, a special folder of JSON files that records every change to the table. When Job A commits its 50 files, it writes a single JSON entry to the log listing all 50 files as one atomic unit. Job B reconstructs the table state by reading the log, not by listing files in the folder. If Job A crashes before committing, those 40 files simply do not appear in the log, so they are invisible to readers. No corruption.

The transaction log becomes the source of truth. Readers and writers coordinate through this log, enabling multiple jobs to safely read and write the same table simultaneously.

✓ In Practice: Delta Lake is the default storage format in Databricks and is used by companies like Netflix and Uber to manage petabyte scale data lakes with hundreds of concurrent jobs.

💡 Key Takeaways

✓Traditional data lakes lack transactions, leading to race conditions when multiple jobs write concurrently or readers see partial writes

✓Delta Lake adds a transaction log (sequence of JSON files) that records every change atomically, becoming the source of truth for table state

✓ACID properties mean Atomicity (all or nothing commits), Consistency (valid states only), Isolation (readers see snapshots), Durability (commits survive failures)

✓Readers reconstruct table state by replaying the log instead of listing files, guaranteeing they never see incomplete or inconsistent data

📌 Interview Tips

1Without Delta Lake: 100 concurrent writers to a Parquet table create race conditions. A reader listing files during writes might see 40 of 50 files from one job, causing aggregate queries to return incorrect totals.

2With Delta Lake: Same 100 writers each commit atomically via the log. A reader starting at 10:00:05am replays log entries up to that time, seeing only complete commits. Incomplete writes are invisible.

← Back to Delta Lake Internals & ACID Transactions Overview