-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data.
I'd say the author's thoughts are valid for basic data processing. Outside of that, most of claims in this article, such as:
"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."
become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.
The author also has another article where they claim that:
"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.
SQL is popular because everyone can learn and start using it after a while. I agree that Python sometimes is a better tool but I don't see SQL going away anytime.
From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.
I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?
But when indexing your json or csv, if you have say 10 rows, each row is separated on your disk instead of all together. So a scan for one columb only needs to read a tenth of the disk space used for the data. Obviously this depends on the columns' content.
Not a duckdb user, but I use polars a lot (mentioned in the article).
Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.
I guess the question is: how much is medium? DuckDB can handle quite a lot of data without breaking a sweat. Certainly if you prefer writing SQL for certain things, it's a no-brainer.
"making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data" > I'm a Data Engineering noob, but DuckDB alone doesn't do metadata & catalog management, which is why they've also introduce DuckLake.
Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.
Being able to use SQL on CSV and json/jsonl files is pretty sweet. Of course it does much more than that, but that's what I do most often with it. Love duckdb.
Indeed! I generally like awk a lot for simpler CSV/TSV processing, but when it comes to cases where you need things like combining/joining multiple CSV files or aggregating for certain columns, SQL really shines IME.
I think my favorite part of duckdb is its flexibility. Its such a handly little swiss army knife for doing analytical processing in scientific environments (messy data w/ many formats).
Agree with the author, will add: duckdb is an extremely compelling choice if you’re a developer and want to embed analytics in your app (which can also run in a web browser with wasm!)
Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.
That's really interesting, I love the idea of being able to use columnar support directly within postgresql.
I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.
Its a really handy tool. I've queried basically everything you can w/ duckdb - csv, json, s3 buckets, MS SQL servers, excel sheets, pandas dataframes, etc - and have had very few issues.
DuckDB has experimental builds for Android ... I'm wondering how much work it would take to implement a Java API for it similar to sqlite (Cursor, etc).
What I love about duckdb:
-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data.
I'd say the author's thoughts are valid for basic data processing. Outside of that, most of claims in this article, such as:
"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."
become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.
The author also has another article where they claim that:
"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.
SQL is popular because everyone can learn and start using it after a while. I agree that Python sometimes is a better tool but I don't see SQL going away anytime.
From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.
I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?
Zonemaps are created for columns automatically. I process somewhat large tables w/ duckdb regularly (100M rows) and never have any problems.
that's true for duckdb native tables, but the question was about json.
But when indexing your json or csv, if you have say 10 rows, each row is separated on your disk instead of all together. So a scan for one columb only needs to read a tenth of the disk space used for the data. Obviously this depends on the columns' content.
Not a duckdb user, but I use polars a lot (mentioned in the article).
Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.
I guess the question is: how much is medium? DuckDB can handle quite a lot of data without breaking a sweat. Certainly if you prefer writing SQL for certain things, it's a no-brainer.
"making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data" > I'm a Data Engineering noob, but DuckDB alone doesn't do metadata & catalog management, which is why they've also introduce DuckLake.
Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.
DuckLake is pretty new, so I guess it would depend on if you need a more mature, fully-featured app.
Being able to use SQL on CSV and json/jsonl files is pretty sweet. Of course it does much more than that, but that's what I do most often with it. Love duckdb.
Indeed! I generally like awk a lot for simpler CSV/TSV processing, but when it comes to cases where you need things like combining/joining multiple CSV files or aggregating for certain columns, SQL really shines IME.
I think my favorite part of duckdb is its flexibility. Its such a handly little swiss army knife for doing analytical processing in scientific environments (messy data w/ many formats).
Agree with the author, will add: duckdb is an extremely compelling choice if you’re a developer and want to embed analytics in your app (which can also run in a web browser with wasm!)
Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.
The wasm is pretty heavy data-wise tho, I’m hoping eventually it’ll be lighter for easier loading on not so good devices.
That's really interesting, I love the idea of being able to use columnar support directly within postgresql.
I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.
Its a really handy tool. I've queried basically everything you can w/ duckdb - csv, json, s3 buckets, MS SQL servers, excel sheets, pandas dataframes, etc - and have had very few issues.
DuckDB has experimental builds for Android ... I'm wondering how much work it would take to implement a Java API for it similar to sqlite (Cursor, etc).