Data modeling guide for real-time analytics with ClickHouse

(ssp.sh)

71 points | by articsputnik 9 hours ago ago

13 comments

nrjames 4 hours ago

I went down this road with Clickhouse and spent 18 months setting it up for realtime analytics. I'm not sure it's going to stick, primarily because our Data Transformation & Aggregation steps require some gnarly joins across large tables and Clickhouse does not handle large joins well. The consequence is that the aggregation has to happen in a separate system (currently using Snowflake) and when there were changes to what we were processing, it sometimes requires gymnastics both in the aggregation layer and inside of Clickhouse to accommodate the change. Denormalizing was rife with tradeoffs, mostly just to make Clickhouse happy. On top of that, we leaned heavily on projections for performance, which is wonderfully automated, but also meant waiting for unpredictable background processing during backloads, etc.

We might stick with Clickhouse, but after working with it for a year and a half, I'm curious to see whether a system that handles joins more effectively would be a better fit. To that end, my next R&D project is to set up a vertical slice of our analytics on Apache Doris to see how well it handles a similar workload.

[-]

Merick 3 hours ago

Curious, you mention Doris. I wonder if you've tried looking into StarRocks?

[-]

nrjames 3 hours ago

Not yet, but it's on the list! This is R&D work that I'm doing on the side, when I have time. Do you prefer StarRocks to Doris?

DataWizard1986 3 hours ago

+1 to StarRocks. You can cut out a lot of the weight associated with denormalization, which ClickHouse almost forces you to do. Crazy big cluster sizes as well

[-]

esafak 3 hours ago

https://medium.com/@marvin_data/testing-query-speed-for-duck...

apwell23 3 hours ago

what about druid ? I don't hear much about druid these days. Has it fallen off?

apwell23 4 hours ago

low latency and high concurrecy is a hard problem with large data.

[-]

nrjames 4 hours ago

We also have a budget issue with Snowflake, so it would help to aggregate elsewhere.

[-]

miked98 3 hours ago

In our experience, the primary driver of Snowflake costs is not the compute for aggregation, but the compute required for lots of reads/scans.

We recently built a Snowflake-to-ClickHouse pipeline for a customer where aggregates are built hourly in Snowflake, then pushed into a ClickHouse table to power their user-facing dashboards.

By offloading dashboard queries to ClickHouse, they slashed their Snowflake bill by ~25%, which was worth millions to them.

(Admittedly, running aggregations elsewhere—for example, in Spark—could further reduce costs, but you would then need Iceberg to make the tables queryable in Snowflake.)

[-]

nrjames 2 hours ago

I'm in an enterprise environment where a central IT platform team controls what size warehouses we can have in Snowflake. They are not receptive to arguments for larger warehouses, unfortunately. Our issue becomes long-running queries b/c Snowflake spills the data to disk during the joins. TBH, I could join the data more quickly on my laptop than in the warehouse I'm allowed to use. Anyhow, I have then an old build server that is beefy & has 512 GB of RAM, so I can set up my aggregation and/or OLAP services there, since it's an unencumbered playground.

sails 4 hours ago

Simon is one of the few people exploring the contemporary challenges of data modelling in a meaningful way.

[-]

apwell23 3 hours ago

Is there anyone i can learn about latest challenges in datamodeling ?

unixhero an hour ago

What even is Clickhouse?