Apache Iceberg, explained interactively

What Iceberg actually is

Apache Iceberg is an open table format. It does not store your data - your data still lives as ordinary Parquet (or ORC/Avro) files in a data lake such as S3, GCS, or ADLS. What Iceberg adds is a layer of metadata that turns those files into a table a query engine can trust: one with a schema, transactions, and a full history.

For years, teams used the older "Hive" approach, where a table was simply a directory and its partitions were sub-folders. That seems tidy until you try to rename a column, change how data is partitioned, delete a few rows, or let two jobs write at once. Directory-as-a-table has no atomic commit and no real schema, so those operations are slow, brittle, or unsafe.

Iceberg fixes this by tracking the exact set of files that make up the table in versioned metadata. Every change writes new metadata and new data files, then flips a single pointer to make the change visible all at once. That one idea - immutable data plus versioned metadata - is what gives you ACID commits, time-travel, safe schema evolution, and partition changes without rewriting history.

The mental model: data files never change in place. To change a table you write new files and atomically swap which files "count". Old files stick around until you deliberately clean them up - which is exactly why you can travel back in time.

The metadata tree

Everything Iceberg does comes from a small tree of files. At the top is the catalog, which holds one pointer: the location of the current metadata file. Below it sits the metadata, the snapshots, the manifests, and finally the data files themselves.

Click through the tree below. Notice how a single snapshot points to exactly one manifest list, which points to manifests, which list the actual data files along with their statistics. Toggle the format version to see how row-level deletes appear in v2 and become deletion vectors in v3.

Iceberg metadata tree

Loading interactive tree…

Catalog → metadata.json → snapshot → manifest list → manifests → data & delete files. The two older snapshots are collapsed - expand them to see how snapshots share files.

What each layer holds

The metadata file (metadata.json) is rewritten on every commit and carries the table UUID, every schema (each with stable field-ids), every partition spec, sort orders, properties, the snapshot log, and the current-snapshot-id. The manifest list - one per snapshot - records each manifest with partition-range summaries so a planner can skip whole manifests. Each manifest lists data (and delete) files with per-file, per-column stats: record counts, null counts, and value bounds. Those stats are what make queries fast: the engine prunes files it can prove are irrelevant before reading a single byte of data.

Snapshots, writes & time-travel

A snapshot is an immutable view of the whole table at one moment. Every write - append, update, delete - produces a brand-new snapshot and leaves the previous ones intact. Because old snapshots still reference their files, you can query the table "as of" any of them, or roll back.

Run some operations below. Watch how an append only adds files, how an update or delete writes a small delete file rather than rewriting data (merge-on-read), and how rollback is just another commit. Then click any point on the timeline to time-travel and see the rows that were visible then.

Highlighted rows changed in the snapshot you are viewing. Time-travel by clicking a snapshot; the latest commit stays put.

How commits stay safe (ACID)

When a writer commits, it prepares new files and then asks the catalog to swap the metadata pointer with a compare-and-swap: "move the pointer from the snapshot I started from to my new one". If another writer committed in the meantime, the swap fails and the writer retries against the fresh state. That optimistic concurrency is how Iceberg gives you serialisable commits without locking the data, and how multiple engines can safely write the same table.

Old snapshots are not kept forever. expire_snapshots drops history beyond a retention window and deletes files no longer referenced by any live snapshot - a safe, deliberate garbage-collection step rather than something that happens on every write.

Schema & partition evolution

This is where Iceberg feels like magic compared with directory-based tables. You can add, drop, rename, reorder, or widen columns, and change how the table is partitioned - all without rewriting a single existing data file.

The trick for schema changes is field-ids. Every column has an immutable numeric id; names and positions are just labels on top. Rename status and its id stays the same, so older files are still read correctly through an id-to-column mapping. Try it:

Schema - field-ids stay stable

Partition spec - old and new coexist

Hidden partitioning & pruning

WHERE order_date =

Partition evolution keeps old files on their original spec; queries read across both. With hidden partitioning you filter on the real column and Iceberg prunes for you.

Hidden partitioning

In Hive tables you had to create a separate partition column (say order_day) and remember to filter on it, or your query would scan everything. Iceberg removes that footgun. You declare a transform - day(order_date), month(order_date), bucket(16, customer_id), truncate(10, region) - and Iceberg derives the partition value itself. You write a normal WHERE order_date = ... and the planner prunes partitions using manifest bounds. And because the partition spec is versioned, you can switch from monthly to daily partitioning later without touching the data already written.

Storage internals & the ecosystem

How does Iceberg apply an update or delete under the hood? Two strategies:

Copy-on-write (COW): rewrite the whole data files that contain affected rows. Reads stay fast (no merging), but writes are heavier. Good for tables read far more than they are changed.
Merge-on-read (MOR): leave data files in place and record the change in a small delete file, merged at read time. Writes are cheap and quick - ideal for streaming and frequent updates - at the cost of a little read-time work, which periodic compaction cleans up.

The format spec has evolved alongside this. v1 covered analytic tables with append and full rewrites. v2 added row-level deletes via delete files (the basis of merge-on-read) and sequence numbers. v3 - ratified in mid-2025 - replaces positional delete files with compact binary deletion vectors stored in Puffin files (at most one per data file per snapshot), and adds new types (including variant and geospatial), default column values, and row-lineage tracking. A v4 is in early development. You opt into a version with 'format-version' = '3'. v3 is ratified, but engine support is still catching up - Spark and Flink lead, others are partial - so confirm your engines before defaulting new tables to v3.

-- Create an Iceberg table (v3 = deletion vectors for row-level deletes)
CREATE TABLE prod.analytics.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    order_date   DATE,
    amount       DECIMAL(10,2),
    status       STRING,
    region       STRING
) USING iceberg
PARTITIONED BY (days(order_date))
TBLPROPERTIES ('format-version' = '3');

-- Evolve the schema: no data rewrite, field-ids stay stable
ALTER TABLE prod.analytics.orders ADD COLUMN channel STRING;
ALTER TABLE prod.analytics.orders RENAME COLUMN status TO order_status;

-- Time-travel by snapshot id or by timestamp
SELECT * FROM prod.analytics.orders VERSION AS OF 3055491638000123;
SELECT * FROM prod.analytics.orders TIMESTAMP AS OF '2026-03-15 00:00:00';

-- Roll back, then reclaim storage and compact small files
CALL prod.system.rollback_to_snapshot('analytics.orders', 9182000000000);
CALL prod.system.expire_snapshots('analytics.orders', TIMESTAMP '2026-03-01 00:00:00');
CALL prod.system.rewrite_data_files('analytics.orders');

from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .config("spark.sql.catalog.prod", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.prod.type", "rest")
    .config("spark.sql.catalog.prod.uri", "https://polaris.example.com/api/catalog")
    .getOrCreate())

# Append rows: a new snapshot + data file; old files are untouched
df.writeTo("prod.analytics.orders").append()

# Read a historic snapshot (as-of timestamp in epoch millis)
(spark.read
    .option("as-of-timestamp", "1773532800000")
    .table("prod.analytics.orders")
    .show())

# Inspect the metadata tree through Iceberg's metadata tables
spark.table("prod.analytics.orders.snapshots").show()
spark.table("prod.analytics.orders.files").show()

-- Register a REST catalog (e.g. Apache Polaris)
CREATE CATALOG prod WITH (
    'type' = 'iceberg',
    'catalog-type' = 'rest',
    'uri' = 'https://polaris.example.com/api/catalog'
);

-- Streaming upsert into an Iceberg table (merge-on-read)
SET 'execution.checkpointing.interval' = '30s';

INSERT INTO prod.analytics.orders
SELECT order_id, customer_id, order_date, amount, status, region
FROM kafka_orders;

-- Batch read of the current snapshot
SELECT region, COUNT(*) AS orders
FROM prod.analytics.orders
GROUP BY region;

The same table is readable and writable from many engines - Spark, Flink, Trino, Dremio, DuckDB, Snowflake, BigQuery - because they all agree on the open spec.

Catalogs & who reads Iceberg

The catalog is what makes a table shareable. The REST catalog spec standardised how engines talk to a catalog over HTTP, and open implementations such as Apache Polaris (a top-level Apache project since early 2026), Lakekeeper, and Nessie now let many engines share one governed set of tables - alongside managed catalogs like AWS Glue, Snowflake, and BigQuery. On the read/write side, Spark, Flink, Trino/Starburst, Dremio, DuckDB, Snowflake, BigQuery, and the Python pyiceberg library all speak Iceberg. The practical upshot: one copy of the data, no lock-in, and engines chosen per workload.

Keep tables healthy with routine maintenance: rewrite_data_files compacts small files, rewrite_manifests tidies metadata, expire_snapshots trims history, and remove_orphan_files deletes files no manifest references.

Key takeaways

Iceberg is a table format, not a storage engine: ordinary files in your lake, plus versioned metadata that makes them a trustworthy table.
The metadata tree - catalog, metadata.json, snapshots, manifest lists, manifests, data/delete files - is the whole game; file and partition statistics drive fast pruning.
Every write is a new immutable snapshot, so time-travel and rollback come for free; commits are ACID via compare-and-swap on the catalog pointer.
Stable field-ids make schema changes safe with no rewrite, and versioned partition specs let old and new layouts coexist.
Copy-on-write favours readers; merge-on-read favours writers; v3 deletion vectors make row-level deletes cheap. Many engines share the one open table.

Check your understanding

Five quick questions. Pick an answer for each and check your score.