What is Apache Iceberg?

Definition
Apache Iceberg is a table format that brings database level semantics (transactions, schema evolution, time travel) to data lakes sitting on object storage like Amazon S3 or Google Cloud Storage.
The Core Problem:
Traditional data lakes act like giant file systems. You store terabytes or petabytes of Parquet or ORC files in S3, organized into directories like year=2024/month=01/day=15/. This is cheap and scales well, but it behaves nothing like a database. What breaks?

First, concurrent writes corrupt your data. If two Spark jobs both try to add files to the same table at the same time, one might overwrite the other's changes. Second, schema evolution is a nightmare. Adding a column means either rewriting all files (expensive) or having inconsistent schemas across files (confusing). Third, no time travel. Once you delete or modify data, it's gone forever. Fourth, slow queries because engines must list directories and scan file metadata for every query.

How Iceberg Fixes This:
Iceberg introduces a metadata layer that tracks exactly which files belong to your table at any point in time. Instead of relying on directory structure, Iceberg maintains three levels of metadata.

First, data files contain your actual rows in formats like Parquet. Second, manifest files list batches of data files along with statistics like row count, minimum and maximum values per column. Third, a table metadata file defines your schema, partitioning strategy, and which manifests make up the current snapshot.

✓ In Practice: When you query an Iceberg table, your engine reads one small metadata file (a few kilobytes) to discover the schema and current snapshot. It then reads manifest files (megabytes total) to find relevant data files, using statistics to skip 95% or more of files that don't match your query filters. Only then does it read actual data files from object storage.

This design enables ACID transactions through atomic metadata updates, snapshot isolation for consistent reads, and time travel by keeping old metadata around. All of this works on the same cheap object storage you already use, accessible from Spark, Flink, Trino, Presto, and other engines.

💡 Key Takeaways

✓Iceberg is a table format, not a storage system. It works on top of your existing object storage (S3, GCS, Azure Blob) and makes it behave like a transactional database.

✓Three metadata layers create the magic: table metadata defines schema and snapshots, manifest files track collections of data files with statistics, data files contain actual rows in Parquet or ORC.

✓Atomic commits happen by updating a single pointer in a catalog (like Hive Metastore) from old metadata file to new metadata file, providing snapshot isolation and preventing concurrent write conflicts.

✓All engines (Spark, Flink, Trino, Presto) can read and write the same Iceberg table correctly because the format is standardized and engine agnostic.

📌 Interview Tips

1Your data lake has 50,000 Parquet files totaling 10 TB. Without Iceberg, adding a column means either rewriting all files or documenting that files created after a certain date have the new column. With Iceberg, you update the schema in one metadata file commit, and readers automatically handle the difference using stable column IDs.

2Two Spark jobs simultaneously try to append data. Without Iceberg, both might succeed but one overwrites files from the other, causing data loss. With Iceberg, each creates a new snapshot from the current state. The first commit succeeds, the second detects the conflict, rebases on the new snapshot, and retries, ensuring both datasets are preserved.

← Back to Apache Iceberg Table Format Overview