Files
posthog.com/contents/blog/clickhouse-materialized-columns.md
2025-11-04 10:27:24 +00:00

7.0 KiB
Raw Blame History

date, title, rootPage, sidebar, showTitle, hideAnchor, author, featuredImage, featuredImageType, category, tags
date title rootPage sidebar showTitle hideAnchor author featuredImage featuredImageType category tags
2021-10-26 How to speed up ClickHouse queries using materialized columns /blog Blog true true
karl-aksel-puulmann
https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/blog/posthog-engineering-blog.png full Engineering
Guides
ClickHouse

ClickHouse supports speeding up queries using materialized columns to create new columns on the fly from existing data. In this post, Ill walk through a query optimization example that's well-suited to this rarely-used feature.

Consider the following schema:

CREATE TABLE events (
    uuid UUID,
    event VARCHAR,
    timestamp DateTime64(6, 'UTC'),
    properties_json VARCHAR,
)
ENGINE = MergeTree()
ORDER BY (toDate(timestamp), event, uuid)
PARTITION BY toYYYYMM(timestamp)

Each event has an ID, event type, timestamp, and a JSON representation of event properties. The properties can include the current URL and any other user-defined properties that describe the event (e.g. NPS survey results, person properties, timing data, etc.).

This table can be used to store a lot of analytics data and is similar to what we use at PostHog.

If we wanted to query login page pageviews in August, the query would look like this:

SELECT count(*)
FROM events
WHERE event = '$pageview'
  AND JSONExtractString(properties_json, '$current_url') = 'https://app.posthog.com/login'
  AND timestamp >= '2021-08-01'
  AND timestamp < '2021-09-01'

This query takes a while complete on a large test dataset, but without the URL filter the query is almost instant. Adding even more filters just slows down the query. Let's dig in to understand why.

Looking at flamegraphs

ClickHouse has great tools for introspecting queries. Looking at system.query_log we can see that the query:

  • Took 3,433 ms
  • Read 79.17 GiB from disk

To dig even deeper, we can use clickhouse-flamegraph to peek into what the CPU did during query execution.