ClickHouse, explained interactively

What ClickHouse is (and is not)

ClickHouse is a columnar OLAP database built for real-time analytics: counting, summing, and grouping over huge event and time-series datasets at interactive speed. The single most important idea is how it stores data. A row store keeps each row's fields together - perfect for "fetch everything about order 123". A column store keeps each column together, so a query that aggregates one column reads only that column.

Toggle the layout below and run different queries. Watch how much data each one has to read.

Row store vs column store

Query

Columnar layout also compresses far better - a column holds one data type with lots of repetition, often 20-50x smaller than a row store. Less data on disk plus vectorised (SIMD) execution is the whole speed story.

The flip side: ClickHouse is not an OLTP database. It is built for batch inserts and append-only analytics, not high-rate single-row updates, deletes, or multi-row transactions. Reach for Postgres or similar when you need those.

MergeTree: how data is stored

The workhorse engine is MergeTree. Data lives in immutable, sorted parts. You choose an ORDER BY (the sorting key) and optionally a PARTITION BY. Crucially, the primary index is sparse: ClickHouse stores one index mark per granule of rows (8,192 by default), recording the granule's boundary values - not one entry per row. It is nothing like an OLTP B-tree, and it is not unique.

That sparse index is how a query reads so little. First partition pruning drops whole partitions; then a binary search on the index narrows to a few granules; only those are read. Try it:

Sparse primary index & pruning

Month user_id =

Partitioned by month; within each part, rows are sorted by user_id (index_granularity 8,192). Each granule's mark shows its minimum user_id. Beyond this, data-skipping indexes (minmax, set, bloom_filter) and PREWHERE can prune even further.

The write path: parts & merges

Every insert writes a new immutable part - so you batch inserts (or rely on async inserts, which are on by default as of the 26.3 LTS) rather than writing rows one at a time. In the background, ClickHouse continuously merges small sorted parts into larger ones, much like an LSM-tree. Fewer parts means a query has less to read and combine.

Parts & merges

Inserts create parts; merges combine them into fewer, larger sorted parts. Deletes and updates never edit a part in place: a heavy ALTER TABLE ... UPDATE/DELETE mutation rewrites whole parts, while lightweight UPDATE/DELETE write small patch parts (or flip a row mask) that are applied at read time and folded in by later merges.

Lightweight updates: patch parts

ClickHouse is still built for analytics, not OLTP - but it now handles corrections gracefully. A heavy ALTER TABLE ... UPDATE (or DELETE) mutation rewrites whole columns in every affected part, so it is best kept for bulk backfills. For small, frequent fixes there are lightweight operations. A lightweight DELETE FROM ... WHERE (production-ready and on by default since 23.3) just flips a hidden _row_exists mask, and the standard SQL UPDATE ... SET ... WHERE (introduced in 25.7, beta and on by default since 25.8) writes a small patch part holding only the changed columns and rows. Both are applied at read time and materialised by a later merge - cheap to write, with a small read-time cost until that merge runs. Try it below.

Lightweight update vs heavy mutation

Stored on disk

SELECT sees (read time)

Illustrative orders part of six rows. A lightweight UPDATE adds a small patch part (one changed row) that the read overlays immediately; a heavy mutation rewrites all six rows; a merge folds the patches back into the part. Watch the rows-written tally.

Table engines: the MergeTree family

Merges do more than tidy up - specialised engines resolve rows by key as they merge. Pick an engine, insert some rows with duplicate keys, then merge and watch what each one does.

Table engines: what merge resolves

Inserted rows (parts)

After merge

There are more engines beyond these: CollapsingMergeTree / VersionedCollapsing for sign-based cancellation, plus Replicated versions of each for high availability, and integration engines (Distributed, Kafka, S3, Dictionary) that connect ClickHouse to the outside world. The recurring gotcha: because resolution happens at merge time, a plain SELECT can still see un-merged duplicates - use FINAL or aggregate in the query when you need an exact answer immediately. And for the occasional ad-hoc correction, a lightweight UPDATE/DELETE (above) is often simpler than modelling it as a ReplacingMergeTree upsert.

Advanced features worth knowing

Once the basics click, these are the features that make ClickHouse powerful in production:

Materialized views - insert-triggered and incremental. Paired with AggregatingMergeTree and the -State/-Merge combinators, they maintain roll-ups (daily revenue, unique users, p95) as data lands. Refreshable materialized views add scheduled rebuilds.
Projections - alternate sort orders or pre-aggregations stored inside a table, which the optimiser can pick automatically.
Data-skipping indexes - minmax, set, and bloom_filter indexes prune granules for filters that are not on the sorting key.
Dictionaries - in-memory key-value lookups for fast joins/enrichment.
Compression codecs - LZ4 (default) or ZSTD, stacked on column codecs like Delta, DoubleDelta, and Gorilla for time-series, plus the ALP adaptive lossless codec (26.3) for floating-point columns and BFloat16 storage for ML vectors.
Native vector search - an HNSW vector_similarity index (native since 25.8) over Array(Float32)/Array(BFloat16) columns, with quantization (bf16 by default, down to binary) and the QBit type for choosing precision at query time. The legacy annoy/usearch index types were removed in its favour. It powers approximate-nearest-neighbour search in plain SQL.
Data-lake integration - ClickHouse reads (and increasingly writes) Iceberg, Delta Lake, and Hudi tables and remote Parquet directly, with Iceberg metadata prefetching and large speed-ups on big remote scans - a bridge to the lakehouse.

Scaling and the 2026 picture

For high availability, ReplicatedMergeTree coordinates replicas through ClickHouse Keeper (the built-in replacement for ZooKeeper), and the Distributed engine fans queries across shards. The bigger shift is SharedMergeTree in ClickHouse Cloud: data lives on object storage and compute is separated from storage, so you scale readers independently and cheaply, with parallel replicas speeding up single queries. The 26.x line (26.3 is the current LTS) turned async inserts on by default and keeps widening the engine: the JSON, Dynamic, and Variant types are production-ready (since 25.3), native vector search is built in (since 25.8), and lightweight updates (beta) make small corrections cheap - so it is no longer accurate to treat ClickHouse as strictly append-only.

Writing efficient queries

Most ClickHouse performance problems are query or schema problems. The high-leverage habits:

Match the sorting key to your filters. The columns you filter and group by most should lead the ORDER BY, so the sparse index can prune.
Read fewer columns. Avoid SELECT * - name the columns you need; that is the whole point of columnar.
Use PREWHERE (or let ClickHouse move filters there) so cheap predicates run before reading wide columns.
Push work into materialized views for repeated aggregations instead of recomputing them on every dashboard load.
Be careful with FINAL and high-cardinality GROUP BY - both can be expensive; prefer aggregate-on-read patterns and low-cardinality keys.
Prefer lightweight UPDATE/DELETE for small corrections. Use UPDATE ... SET ... WHERE or DELETE FROM ... WHERE (patch parts or a row mask, applied at read time) for ad-hoc fixes, and reserve heavy ALTER TABLE ... UPDATE/DELETE mutations for big backfills.
Batch inserts. Many tiny inserts create many parts and heavy merge pressure; insert in larger blocks or use async inserts.

-- A MergeTree table for event analytics
CREATE TABLE events (
    ts       DateTime              CODEC(DoubleDelta, LZ4),
    user_id  UInt32               CODEC(Delta, LZ4),
    country  LowCardinality(String),
    event    LowCardinality(String),
    amount   Decimal(10, 2)
)
ENGINE = MergeTree
ORDER BY (toDate(ts), user_id)     -- sorting key = the sparse primary index
PARTITION BY toYYYYMM(ts)          -- prune whole months
TTL ts + INTERVAL 90 DAY           -- drop old partitions automatically
SETTINGS index_granularity = 8192;

-- Read only what you need; filter on the sorting key
SELECT
    toStartOfDay(ts)        AS day,
    country,
    count()                 AS events,
    sum(amount)             AS revenue,
    quantile(0.95)(amount)  AS p95
FROM events
PREWHERE event = 'purchase'        -- cheap filter, evaluated first
WHERE ts >= '2026-03-01' AND ts < '2026-04-01'
GROUP BY day, country
ORDER BY day, revenue DESC;

-- Incremental roll-up: AggregatingMergeTree + state combinators
CREATE TABLE events_daily (
    day        Date,
    country    LowCardinality(String),
    revenue    AggregateFunction(sum, Decimal(10, 2)),
    p95        AggregateFunction(quantile(0.95), Decimal(10, 2)),
    uniq_users AggregateFunction(uniq, UInt32)
) ENGINE = AggregatingMergeTree
ORDER BY (day, country);

CREATE MATERIALIZED VIEW events_daily_mv TO events_daily AS
SELECT toDate(ts) AS day, country,
       sumState(amount)            AS revenue,
       quantileState(0.95)(amount) AS p95,
       uniqState(user_id)          AS uniq_users
FROM events
GROUP BY day, country;

-- Finalise the states at read time with -Merge
SELECT day, country,
       sumMerge(revenue)        AS revenue,
       quantileMerge(0.95)(p95) AS p95,
       uniqMerge(uniq_users)    AS users
FROM events_daily
GROUP BY day, country;

-- Data-skipping indexes prune granules for non-sort-key filters
ALTER TABLE events
    ADD INDEX idx_country country TYPE set(100) GRANULARITY 4,
    ADD INDEX idx_amount  amount  TYPE minmax  GRANULARITY 1;

-- A filter on country can now skip granules whose value-set excludes it
SELECT sum(amount) FROM events
WHERE country = 'DE' AND ts >= '2026-03-01';

-- Native vector search: an HNSW index over an embedding column
CREATE TABLE docs (
    id        UInt64,
    title     String,
    embedding Array(Float32),
    INDEX idx_vec embedding
        TYPE vector_similarity('hnsw', 'cosineDistance', 768, 'bf16')  -- type, distance, dims, quantization
)
ENGINE = MergeTree
ORDER BY id;

-- Approximate nearest-neighbour search in plain SQL
WITH {q: Array(Float32)} AS query_vec
SELECT id, title, cosineDistance(embedding, query_vec) AS dist
FROM docs
ORDER BY dist          -- the HNSW index serves this ORDER BY ... LIMIT
LIMIT 10;

Key takeaways

ClickHouse is a columnar OLAP database: queries read only the columns they need, which (with great compression and vectorised execution) is the source of its speed. It is not for OLTP.
MergeTree stores immutable sorted parts; the primary index is sparse (one mark per ~8,192-row granule), so partition pruning + granule skipping read a tiny slice.
Inserts create parts; background merges combine them - batch your inserts, and remember fewer parts means faster reads.
Engines resolve rows on merge: ReplacingMergeTree (latest), SummingMergeTree (sum), AggregatingMergeTree (states). Use FINAL or aggregate when you need exact results before a merge.
Write efficient queries: match the ORDER BY to your filters, select only needed columns, use PREWHERE, and push repeated aggregations into materialized views.

Check your understanding

Five quick questions.