7.0 KiB
date, title, rootPage, sidebar, showTitle, hideAnchor, author, featuredImage, featuredImageType, category, tags
| date | title | rootPage | sidebar | showTitle | hideAnchor | author | featuredImage | featuredImageType | category | tags | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2021-10-26 | How to speed up ClickHouse queries using materialized columns | /blog | Blog | true | true |
|
https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/blog/posthog-engineering-blog.png | full | Engineering |
|
ClickHouse supports speeding up queries using materialized columns to create new columns on the fly from existing data. In this post, I’ll walk through a query optimization example that's well-suited to this rarely-used feature.
Consider the following schema:
CREATE TABLE events (
uuid UUID,
event VARCHAR,
timestamp DateTime64(6, 'UTC'),
properties_json VARCHAR,
)
ENGINE = MergeTree()
ORDER BY (toDate(timestamp), event, uuid)
PARTITION BY toYYYYMM(timestamp)
Each event has an ID, event type, timestamp, and a JSON representation of event properties. The properties can include the current URL and any other user-defined properties that describe the event (e.g. NPS survey results, person properties, timing data, etc.).
This table can be used to store a lot of analytics data and is similar to what we use at PostHog.
If we wanted to query login page pageviews in August, the query would look like this:
SELECT count(*)
FROM events
WHERE event = '$pageview'
AND JSONExtractString(properties_json, '$current_url') = 'https://app.posthog.com/login'
AND timestamp >= '2021-08-01'
AND timestamp < '2021-09-01'
This query takes a while complete on a large test dataset, but without the URL filter the query is almost instant. Adding even more filters just slows down the query. Let's dig in to understand why.
Looking at flamegraphs
ClickHouse has great tools for introspecting queries. Looking at system.query_log we can see that the query:
- Took 3,433 ms
- Read 79.17 GiB from disk
To dig even deeper, we can use clickhouse-flamegraph to peek into what the CPU did during query execution.