Why table formats exist
A data lake is just files in object storage. The old "Hive" approach treated a folder of those files as a table and its sub-folders as partitions. That falls apart the moment you need atomic commits, safe schema changes, row-level updates, or concurrent writers. An open table format fixes this by adding a metadata layer that tracks exactly which files make up the table, with transactions and history on top.
Four established formats dominate, with an emerging fifth that rethinks where the metadata should live - and none started from the same place:
- Apache Iceberg - born at Netflix, vendor-neutral, analytics-first, broadest engine support.
- Delta Lake - born at Databricks (now Linux Foundation), tightest Spark integration.
- Apache Hudi - born at Uber for upserts, CDC, and incremental processing.
- Apache Paimon - born from Flink (formerly Flink Table Store), streaming-first.
- DuckLake - from the DuckDB team (v1.0, April 2026), keeps your data in Parquet but stores all table metadata in a SQL database instead of metadata files.
The five formats, side by side
Explore the comparison below. Click any row to read what the difference means in practice, or focus a single format to see its column clearly.
How they really differ
Strip away the marketing and the differences come down to a few axes.
Write strategy: copy-on-write vs merge-on-read
Copy-on-write rewrites whole data files when rows change - fast reads, heavier writes. Merge-on-read records changes in small delete or log files and merges them at read time - cheap, quick writes at a little read cost, cleaned up by compaction. Hudi and Paimon were designed around merge-on-read; Iceberg and Delta began copy-on-write and have now converged at the data layer - with Iceberg v3 GA on Databricks, Snowflake, and AWS, both share Parquet plus deletion vectors, so a write in one is interchangeable with the other without conversion. DuckLake writes plain Parquet with delete files too, but routes every change - even small inlined ones - through its SQL catalog, so commits are near-instant.
Streaming and CDC
This is the sharpest divide. Paimon is streaming-first and Flink-native, with first-class changelog producers. Hudi has deep CDC and incremental-query support and upsert DNA from its Uber origins. Iceberg is analytics-first (it added changelog reads in v3) and Delta now ships Auto CDF and row tracking in 4.1, which makes change feeds easier, though it is still not changelog-native the way Paimon and Hudi are. DuckLake is batch-first, exposing changes between snapshots rather than a streaming changelog.
Catalogs, governance, and partitioning
The catalog is now where the real contest sits. Both Iceberg and Delta 4.1 support catalog-managed tables, and the live rivalry is Unity Catalog versus Apache Polaris (an Apache top-level project since February 2026). Iceberg still has the richest catalog ecosystem - the REST catalog spec plus open implementations (Polaris, Nessie, Lakekeeper) and managed options (Glue, Unity, Snowflake) - and the most flexible partitioning, with true hidden partitioning and partition-spec evolution. DuckLake is catalog-native by design: the SQL database is the catalog, which is what makes its commits and query planning fast. The others largely expose partition columns you manage yourself.
The format war is cooling
The most important development is not a winner - it is convergence. At the data layer the fight is effectively over: with Iceberg v3 now GA on Databricks, Snowflake, and AWS, Iceberg and Delta share Parquet, deletion vectors, variant, and row tracking, so you can write in one and read in many without copies:
- Delta UniForm lets a Delta table emit Iceberg (and Hudi) metadata over the same Parquet files, so other engines can read it without copies.
- Apache XTable (formerly OneTable, still Apache Incubating) translates metadata between Iceberg, Delta, and Hudi - vendor-neutral, no data rewrite.
- The Iceberg REST catalog is becoming the de-facto standard for governed, multi-engine access, with Apache Polaris (a top-level project since February 2026) as an open implementation.
- The next step is shared metadata. Iceberg v4 and Delta 5.0 are being designed around a common adaptive metadata tree (Delta RFC #6640), Hudi 1.1 added a pluggable table-format framework, and DuckLake takes a different route entirely - putting the whole catalog in a SQL database.
The honest caveat: convergence is real at the data-and-read layer, but on write you still commit to one format and, increasingly, one catalog. So choose for your write path and your catalog - then let UniForm, XTable, and the REST catalog handle the reads.
Which should you choose?
There is no single best format - only a best fit for your situation. Answer a few questions for a starting point, then read the guide below.
A pragmatic guide
- Vendor-neutral analytics, many engines: Iceberg. Broadest support and an open REST catalog.
- Databricks-centric shop: Delta as primary, with UniForm for Iceberg reads elsewhere.
- Streaming-first on Flink: Paimon; on Spark with strong CDC, Hudi.
- Heavy upserts / incremental on Spark: Hudi.
- DuckDB-centric, simple lakehouse: DuckLake - metadata in a SQL database, data in Parquet, very few moving parts.
- The catalog now matters more than the format: at the data layer Iceberg and Delta have converged, so invest in your catalog (Unity Catalog or Apache Polaris) and let interop handle the reads.
Create the same table in each
The same orders table, five formats. Notice how each one's defaults reveal its priorities - Iceberg's hidden partitioning, Hudi's primary key and merge-on-read, Paimon's changelog producer, DuckLake's SQL catalog.
Check your understanding
Five quick questions on the real differences.