Data Lakes & LakehousesLakehouse Architecture (Delta, Iceberg, Hudi)Medium⏱️ ~3 min

How Table Formats Provide ACID Guarantees

The Challenge: Object storage like S3 is eventually consistent and has no concept of transactions. If you write 10 files for a new partition and your job crashes after writing 5, readers might see partial data. If two writers update the same table simultaneously, they could create conflicting file sets with no way to merge changes. Traditional databases solve this with transaction logs and locks, but object storage has no such mechanism built in. The Transaction Log Approach (Delta Lake): Delta Lake maintains an append only transaction log stored as JSON files alongside your data in S3. Each commit writes a new log entry (numbered sequentially: 00000.json, 00001.json, etc.) that describes which data files were added or removed, schema changes, and metadata updates. Readers replay log entries to compute the current table state.
Optimistic Concurrency Example
Version 42
WRITER A READS
Version 43
WRITER B COMMITS FIRST
RETRY
WRITER A DETECTS CONFLICT
When Writer A tries to commit version 43, it sees that version already exists. It retries by reading the new state (version 43) and attempting version 44. This optimistic concurrency control ensures atomicity without distributed locks. Under high write contention (50+ concurrent writers on the same partition), you may see 10 to 20% of commits retry, adding 200 to 500 milliseconds of latency. The Metadata Tree Approach (Iceberg): Iceberg uses a layered metadata structure. The top level table metadata file (a JSON file) points to the current snapshot. Each snapshot references manifest list files, which reference manifest files, which contain details about data files (path, partition values, row counts, column statistics). This tree structure enables efficient metadata caching and parallel planning. At Netflix scale (10+ petabytes, millions of files), query planning reads only the metadata tree (a few hundred KB to a few MB) instead of listing all files, keeping planning time under 1 second. The Timeline Approach (Hudi): Hudi maintains a timeline of commits as a sequence of files in the .hoodie directory. Each commit file contains metadata about the operation (insert, update, delete) and affected files. Hudi supports two table types: Copy On Write (COW), where updates rewrite full data files, and Merge On Read (MOR), where updates go to delta log files that are later compacted. MOR enables high frequency upserts (100k+ events per second from Change Data Capture streams) with minimal write amplification.
💡 Key Takeaways
Delta Lake uses append only transaction logs with optimistic concurrency: writers read current version, make changes, then atomically commit the next version or retry if conflict detected
Iceberg uses a metadata tree (table metadata → manifest lists → manifests → data files) that enables sub second query planning even with millions of files by caching and pruning
Hudi maintains a timeline of commits with Copy On Write (rewrite full files) or Merge On Read (delta logs) table types, optimizing for different write patterns
At high write concurrency (50+ writers), Delta may see 10 to 20% commit retries adding 200 to 500ms latency, requiring partition design that minimizes contention
📌 Examples
1Delta Lake transaction log: commit 00042.json adds 3 files, commit 00043.json removes 1 file and adds 2 more. Reader replays both to compute current file set atomically
2Iceberg at Netflix: with 10 million data files, metadata tree is only 50 MB, enabling query planning in under 1 second vs scanning file listings which could take minutes
3Hudi MOR table ingests 200k CDC events per second into delta logs, runs compaction every 30 minutes to merge logs into base files, keeping read query latency under 3 seconds
← Back to Lakehouse Architecture (Delta, Iceberg, Hudi) Overview
How Table Formats Provide ACID Guarantees | Lakehouse Architecture (Delta, Iceberg, Hudi) - System Overflow