> The SQL above results in a plan similar to the DuckDB optimized plan, but it is wordier and more error-prone to write, which can potentially lead to bugs.
FWIW, aside from manual filter pushdown, I consider the JOIN variant the canonical / "default" way to merge multiple tables; it keeps all the join-related logic in one place, while mixing both joining conditions and filtering conditions in WHERE always felt more error-prone to me.
> This means your optimizations need to be applied by hand, which is sustainable if your data starts changing.
Seems like a missing "un" here
Compelling article! I've already found DuckDB to be the most ergonomic tool for quick and dirty wrangling, it's good to know it can handle massive jobs too.
I regularly use duckdb on datasets of 1B+ rows, with nasty strong columns that may be over 10MB per value in the outliers. Mostly it just works, and fast too! When it doesn't, I'll usually just dump to parquet and hit it with sparksql, but that is the exception rather than the rule.
> The SQL above results in a plan similar to the DuckDB optimized plan, but it is wordier and more error-prone to write, which can potentially lead to bugs.
FWIW, aside from manual filter pushdown, I consider the JOIN variant the canonical / "default" way to merge multiple tables; it keeps all the join-related logic in one place, while mixing both joining conditions and filtering conditions in WHERE always felt more error-prone to me.
It is also the only way to represent join conditions for outer joins.
> This means your optimizations need to be applied by hand, which is sustainable if your data starts changing.
Seems like a missing "un" here
Compelling article! I've already found DuckDB to be the most ergonomic tool for quick and dirty wrangling, it's good to know it can handle massive jobs too.
I regularly use duckdb on datasets of 1B+ rows, with nasty strong columns that may be over 10MB per value in the outliers. Mostly it just works, and fast too! When it doesn't, I'll usually just dump to parquet and hit it with sparksql, but that is the exception rather than the rule.