Lakehouse table formats compared

Why table formats exist

A data lake is just files in object storage. The old "Hive" approach treated a folder of those files as a table and its sub-folders as partitions. That falls apart the moment you need atomic commits, safe schema changes, row-level updates, or concurrent writers. An open table format fixes this by adding a metadata layer that tracks exactly which files make up the table, with transactions and history on top.

Four established formats dominate, with an emerging fifth that rethinks where the metadata should live - and none started from the same place:

Apache Iceberg - born at Netflix, vendor-neutral, analytics-first, broadest engine support.
Delta Lake - born at Databricks (now Linux Foundation), tightest Spark integration.
Apache Hudi - born at Uber for upserts, CDC, and incremental processing.
Apache Paimon - born from Flink (formerly Flink Table Store), streaming-first.
DuckLake - from the DuckDB team (v1.0, April 2026), keeps your data in Parquet but stores all table metadata in a SQL database instead of metadata files.

The five formats, side by side

Explore the comparison below. Click any row to read what the difference means in practice, or focus a single format to see its column clearly.

Format comparison

Loading comparison…

How they really differ

Strip away the marketing and the differences come down to a few axes.

Write strategy: copy-on-write vs merge-on-read

Copy-on-write rewrites whole data files when rows change - fast reads, heavier writes. Merge-on-read records changes in small delete or log files and merges them at read time - cheap, quick writes at a little read cost, cleaned up by compaction. Hudi and Paimon were designed around merge-on-read; Iceberg and Delta began copy-on-write and have now converged at the data layer - with Iceberg v3 GA on Databricks, Snowflake, and AWS, both share Parquet plus deletion vectors, so a write in one is interchangeable with the other without conversion. DuckLake writes plain Parquet with delete files too, but routes every change - even small inlined ones - through its SQL catalog, so commits are near-instant.

Streaming and CDC

This is the sharpest divide. Paimon is streaming-first and Flink-native, with first-class changelog producers. Hudi has deep CDC and incremental-query support and upsert DNA from its Uber origins. Iceberg is analytics-first (it added changelog reads in v3) and Delta now ships Auto CDF and row tracking in 4.1, which makes change feeds easier, though it is still not changelog-native the way Paimon and Hudi are. DuckLake is batch-first, exposing changes between snapshots rather than a streaming changelog.

Catalogs, governance, and partitioning

The catalog is now where the real contest sits. Both Iceberg and Delta 4.1 support catalog-managed tables, and the live rivalry is Unity Catalog versus Apache Polaris (an Apache top-level project since February 2026). Iceberg still has the richest catalog ecosystem - the REST catalog spec plus open implementations (Polaris, Nessie, Lakekeeper) and managed options (Glue, Unity, Snowflake) - and the most flexible partitioning, with true hidden partitioning and partition-spec evolution. DuckLake is catalog-native by design: the SQL database is the catalog, which is what makes its commits and query planning fast. The others largely expose partition columns you manage yourself.

The format war is cooling

The most important development is not a winner - it is convergence. At the data layer the fight is effectively over: with Iceberg v3 now GA on Databricks, Snowflake, and AWS, Iceberg and Delta share Parquet, deletion vectors, variant, and row tracking, so you can write in one and read in many without copies:

Delta UniForm lets a Delta table emit Iceberg (and Hudi) metadata over the same Parquet files, so other engines can read it without copies.
Apache XTable (formerly OneTable, still Apache Incubating) translates metadata between Iceberg, Delta, and Hudi - vendor-neutral, no data rewrite.
The Iceberg REST catalog is becoming the de-facto standard for governed, multi-engine access, with Apache Polaris (a top-level project since February 2026) as an open implementation.
The next step is shared metadata. Iceberg v4 and Delta 5.0 are being designed around a common adaptive metadata tree (Delta RFC #6640), Hudi 1.1 added a pluggable table-format framework, and DuckLake takes a different route entirely - putting the whole catalog in a SQL database.

The honest caveat: convergence is real at the data-and-read layer, but on write you still commit to one format and, increasingly, one catalog. So choose for your write path and your catalog - then let UniForm, XTable, and the REST catalog handle the reads.

Which should you choose?

There is no single best format - only a best fit for your situation. Answer a few questions for a starting point, then read the guide below.

Find your starting point

What is your primary workload?

Batch analytics / BI Streaming / low-latency CDC Frequent upserts on existing data

Primary engine or platform?

Databricks / Spark-first Flink-first Trino / Dremio / many engines Snowflake / BigQuery

What matters most?

Vendor-neutral, avoid lock-in Simplicity + tight Spark integration Lowest-latency freshness Best upsert performance

Do many engines need to read the same table?

Yes - several engines read it Mostly one engine

A pragmatic guide

Vendor-neutral analytics, many engines: Iceberg. Broadest support and an open REST catalog.
Databricks-centric shop: Delta as primary, with UniForm for Iceberg reads elsewhere.
Streaming-first on Flink: Paimon; on Spark with strong CDC, Hudi.
Heavy upserts / incremental on Spark: Hudi.
DuckDB-centric, simple lakehouse: DuckLake - metadata in a SQL database, data in Parquet, very few moving parts.
The catalog now matters more than the format: at the data layer Iceberg and Delta have converged, so invest in your catalog (Unity Catalog or Apache Polaris) and let interop handle the reads.

Create the same table in each

The same orders table, five formats. Notice how each one's defaults reveal its priorities - Iceberg's hidden partitioning, Hudi's primary key and merge-on-read, Paimon's changelog producer, DuckLake's SQL catalog.

-- Apache Iceberg: hidden partitioning, v3 deletion vectors for deletes
CREATE TABLE ecom.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    order_date   DATE,
    amount       DECIMAL(10,2),
    status       STRING
) USING iceberg
PARTITIONED BY (day(order_date), bucket(16, customer_id))
TBLPROPERTIES ('format-version' = '3');

-- Delta Lake 4.1: liquid clustering, deletion vectors, catalog-managed via Unity
CREATE TABLE ecom.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    order_date   DATE,
    amount       DECIMAL(10,2),
    status       STRING
) USING delta
CLUSTER BY (order_date, customer_id)
TBLPROPERTIES (
    'delta.enableDeletionVectors' = 'true',
    'delta.universalFormat.enabledFormats' = 'iceberg'   -- expose Iceberg reads
);

-- Apache Hudi: merge-on-read, primary key, ordering field for upserts
CREATE TABLE ecom.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    order_date   DATE,
    amount       DECIMAL(10,2),
    status       STRING,
    updated_at   TIMESTAMP
) USING hudi
PARTITIONED BY (order_date)
TBLPROPERTIES (
    'type' = 'mor',
    'primaryKey' = 'order_id',
    'preCombineField' = 'updated_at',
    'hoodie.table.cdc.enabled' = 'true'
);

-- Apache Paimon (Flink): streaming primary-key table with a changelog
CREATE TABLE ecom.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    order_date   STRING,
    amount       DECIMAL(10,2),
    status       STRING,
    PRIMARY KEY (order_id) NOT ENFORCED
)
PARTITIONED BY (order_date)
WITH (
    'changelog-producer' = 'lookup',
    'bucket' = '16',
    'file.format' = 'parquet'
);

-- DuckLake: metadata in a SQL catalog (Postgres here), data in Parquet
INSTALL ducklake;
ATTACH 'ducklake:postgres:dbname=catalog host=localhost' AS lake
    (DATA_PATH 's3://warehouse/ecom/');

CREATE TABLE lake.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    order_date   DATE,
    amount       DECIMAL(10,2),
    status       VARCHAR
);
-- Iceberg-style partition transforms, tracked in the catalog
ALTER TABLE lake.orders SET PARTITIONED BY (day(order_date), bucket(16, customer_id));

All five keep data in open columnar files (usually Parquet). What changes is where the metadata lives - files for most, a SQL database for DuckLake - and the defaults each format encourages.

Check your understanding

Five quick questions on the real differences.