How Iceberg Commits Work: Snapshot Isolation Mechanics

The Commit Challenge:
When multiple writers try to modify a table on object storage, you face a fundamental problem: object stores like S3 provide no built in locking or transactions. Files are eventually consistent. Two processes can both read the current state, make changes, and overwrite each other's work without even knowing. Iceberg solves this through optimistic concurrency control combined with atomic catalog updates.

The Five Step Commit Process:1
Read current metadata: Writer loads the table metadata file to get current snapshot ID, schema version, and partition spec. This takes 10 to 50 milliseconds to fetch one small JSON file from object storage.
2
Write data files: Writer outputs new Parquet files to object storage. These are typically 512 MB to 1 GB each to balance parallelism and file count. At 100 MB/sec write speed per worker, a 1 GB file takes 10 seconds.
3
Build manifest entries: For each data file, writer computes statistics (row count, column min/max values, null counts) and creates a manifest entry. These entries get written to new manifest files.
4
Write new table metadata: Writer creates a new snapshot pointing to the manifest list, generates a new table metadata file with incremented version, and writes it to object storage. Critical: the old metadata file remains unchanged.
5
Atomic catalog update: Writer calls the catalog API (Hive Metastore, AWS Glue, REST catalog) to update the table pointer from the old metadata file to the new one, but ONLY if the current pointer still matches what the writer originally read. This compare and swap operation is the critical atomic step.
When Conflicts Happen:
If two writers both start from snapshot 42, both create new files and metadata, and both try to update the catalog, the first succeeds and moves the pointer to snapshot 43. The second writer's compare and swap fails because the current snapshot is now 43, not 42.

Commit Timeline
START
Snap 42
→
WRITER 1
Snap 43
→
WRITER 2 RETRIES
Snap 44

The losing writer must retry. It reads the new current snapshot (43), potentially merges or validates its changes against what just got committed, creates a new snapshot (44) building on top of 43, and attempts the commit again. Properly implemented writers handle this automatically, but if retries are not implemented, the job fails and data is lost.

The Performance Story:
The beauty is that the expensive part (writing gigabytes of data files) happens in parallel without coordination. Only the final metadata commit needs coordination, and that operates on kilobyte sized files. Catalog updates typically complete in 50 to 300 milliseconds, so commit latency remains low even at high concurrency. Netflix reports handling thousands of commits per day across thousands of tables with this pattern.

💡 Key Takeaways

✓Optimistic concurrency allows writers to work in parallel writing data files without coordination. Only the final catalog update requires an atomic compare and swap operation.

✓The catalog stores just a pointer to the current metadata file. All actual data and metadata live in object storage. This keeps the catalog lightweight and highly available.

✓Commits fail fast when conflicts occur. A writer detects the conflict in milliseconds during the catalog update, not after writing gigabytes of data, because data writes happen before attempting the commit.

✓Snapshot isolation means readers always see a consistent view. A query that starts on snapshot 42 continues reading snapshot 42 even if writers commit snapshots 43, 44, and 45 during the query execution.

✓Typical commit latency is 50 to 300 milliseconds for the metadata operations, but the overall write job takes seconds to minutes depending on data volume, since writing data files dominates the time.

📌 Interview Tips

1A streaming job writes 5 GB of new data every minute. It buffers events, writes 10 files of 512 MB each to S3 (about 50 seconds at 100 MB/sec per file in parallel), then commits by updating one metadata pointer. The commit itself takes 150 milliseconds, but the end to end cycle is 60 seconds.

2Two batch jobs both read snapshot 100 at midnight, spend 30 minutes processing, and try to commit at 12:30 AM. The first commits snapshot 101 successfully. The second gets a conflict, re reads snapshot 101, verifies its new data does not overlap, creates snapshot 102, and commits successfully 200 milliseconds later.

← Back to Apache Iceberg Table Format Overview