I think the competition for the future is between DuckDB and Polars. Will we stick with the DataFrame model, made feasible by Polars's lazy execution, or will we go with in-process SQL a la DuckDB? Personally I've been using DuckDB because I already know SQL (and DuckDB provides persistence if I need it) and don't want to learn a new DataFrame DSL but I'd love to hear other the experience of other people.
I’d recommend using the polars SQL context manager if wanting to defer learning how to do everything through their API. The API is a big enough shift from pandas it took me a minute to figure out but I really enjoy having the choice to stay in dataframe methods or switch to SQL only transformations. It has global state too if that’s needed. I like that it isn’t a RDBMS but provides all of the SQL I use.
I'm very split. There's a lot of interactive exploration and data transformations that SQL lends itself to poorly (try transposing in SQL - not fun!) but I really like the idea of data system that is language agnostic like DuckDB
I really like the dataframe approach. I think it’s because I like REPL-driven-development where I can drop into the REPL and work through how to transform the data interactively.
To be fair, it can nearly always be done in SQL also (unless it’s ML or some Python-specific thing like that), but the SQL with nested queries and numerous CTEs is harder for me to wrap my brain around.
If I were betting, I’d pick DuckDB, because DuckDB seems more able to implement something Polars-like, than Polars is to implement something DuckDB-like.
I am just using duckdb on a 3TB dataset in a beefy ec2, and am pleasantly surprised at its performance on such a large table. I had to do some sharding to be sure but am able to match performance of snowflake or other cluster based systems using this single machine instance.
To clarify Clickhouse will likely match this performance as well, but doing things on a single machines look sexier to me than it ever did in decades.
Where does your data reside, is it on an attached EBS volume, or in S3, or somewhere else?
I had some spare time and tinkered with duckdb with a 70GB dataset, but just getting the 70GB on to the EC2 took hours. Would be pretty rocking if duckdb team could somehow set up a ~1TB sized demo that anyone can setup and try for themselves in, say, under an hour.
Local drives. DONT USE EBS! you’ll incur a huge IO charge. You have to choose instances with attached nvme storage which means one of the storage optimized instances.
Reading the data off s3 will mean you will be slower than offerings like snowflake. Snowflake has optimized the crap out of doing analytics in s3, so you can’t beat it with something as simple as duckdb.
Importantly you need the data in some distributed format like parquet or split csv. Otherwise duckdb can’t read it in parallel.
Hi – DuckDB Labs devrel here. It's great that you find DuckDB useful!
On the setup side, I agree that local (instance-attached) disks should be preferred but does EBS incur an IO fee? It incurs a significant latency for sure but it doesn't have a per-operation pricing:
> I/O is included in the price of the volumes, so you pay only for each GB of storage you provision.
I tried to spread large dataset into thousands of files on S3 and use StepFunctions Distributed Map to launch thousands of Lambda instances to process those files in parallel, using DuckDB (or other libs) in Lambda. The parallel loading and processing is way faster than doing this in a single big EC2 instance.
I’ve tried reading streamed parquet via PyArrow with Duck, and it’s been pretty promising. Depending on the query, you won’t need to download everything off HTTP.
we use partitioned parquet files in s3. we use a csv in the bucket root to track the files. i’m sure there’s a better way but for now the 2tb of data are stored cheaply and we get fast reads by only reading the partitions we need to read.
I'm curious how much simpler to build, manage, and run vs cost it would be to simply running a database on a large vultr/DO instance and paying for 2tb of storage?
I feel like you'd get away with the whole thing for around $500/mo depending on how much compute was needed?
well that's not the infrastructure we have. we are primarily an aws shop so we use the resources available to us in the context of our infrastructure decisions. it would be a hard sell to buy something outside of that ecosystem.
The test case of a simple aggregation is a good example of an important data science skill knowing when and here to use a given tool, and that there is no one right answer for all cases. Although it's worth noting that DuckDB and polars are comparable performance-wise for aggregation (DuckDB slightly faster: https://duckdblabs.github.io/db-benchmark/ ).
For my cases with polars and function piping, certain aspects of that workflow are hard to represent in SQL, and additionally it's easier for iteration/testing on a given aggregation to add/remove a given function pipe, and to relate to existing tables (e.g. filter a table to only IDs present in a different table, which is more algorithmically efficient than a join-then-filter). To do the ETL I tend to do for my data science workin pandas/polars in SQL/DuckDB, it would require chains of CTEs or other shenanigans, which eliminates similicity and efficincy.
The real winner is going to be a framework that, during dev, transparently materializes CTEs to temporary tables so you can iterate on them like you’re saying, while continuing to harness SQL for the end product.
> Note that DuckDB automatically figured out how to parse the date column.
It kinda did and it kinda didn't. Author got lucky that Transaction.csv contained a date where the day was after the 12th in a given month. Had there not been such a date, DuckDB would have gotten the dates wrong and read it as dd/mm/yyyy.
I think a warning from DuckDB would have been in order.
My biggest issue with DuckDB is its not willing to implement edits to blob storages which allow edits (Azure). Having common object/blob storages that can be interacted and operated by multiple process will make it much more amenable to many data science driven workflows.
It's an expression of a personal experience, preferences, and thoughts on a personal blog, thrown for others that might care about DuckDb and Pandas/Polars (and many did, as it got in the HN's first page).
They didn't write it to be some novel research, some canonical tutorial about the tech, or to teach/amuse each and every random reader.
It's not so much about size but about usage pattern.
If your workloads require fast writes and reads, SQLite will probably work fine.
If you're looking to run analytic, columnar queries (which tend to involve a lot of aggregation and joins on a few columns (say less than 50) at a time), then DuckDB is way more optimized.
Oversimplifying, Sqlite is more OLTP and DuckDB is more OLAP.
I think the competition for the future is between DuckDB and Polars. Will we stick with the DataFrame model, made feasible by Polars's lazy execution, or will we go with in-process SQL a la DuckDB? Personally I've been using DuckDB because I already know SQL (and DuckDB provides persistence if I need it) and don't want to learn a new DataFrame DSL but I'd love to hear other the experience of other people.
I’d recommend using the polars SQL context manager if wanting to defer learning how to do everything through their API. The API is a big enough shift from pandas it took me a minute to figure out but I really enjoy having the choice to stay in dataframe methods or switch to SQL only transformations. It has global state too if that’s needed. I like that it isn’t a RDBMS but provides all of the SQL I use.
https://docs.pola.rs/api/python/stable/reference/sql/python_...
https://docs.pola.rs/api/python/stable/reference/sql/python_...
I've written a fair bit of PySpark code and Polars's syntax feels fairly similar, but it also offers a limited SQL dialect.
I'm very split. There's a lot of interactive exploration and data transformations that SQL lends itself to poorly (try transposing in SQL - not fun!) but I really like the idea of data system that is language agnostic like DuckDB
I really like the dataframe approach. I think it’s because I like REPL-driven-development where I can drop into the REPL and work through how to transform the data interactively.
To be fair, it can nearly always be done in SQL also (unless it’s ML or some Python-specific thing like that), but the SQL with nested queries and numerous CTEs is harder for me to wrap my brain around.
If I were betting, I’d pick DuckDB, because DuckDB seems more able to implement something Polars-like, than Polars is to implement something DuckDB-like.
I'm with you. I also like the IDE niceties like autocomplete and docs on hover that don't really work on SQL
I'm hoping someone writes a Python LSP that understands DuckDB SQL.
I use DuckDB and I typically write correct SQL, but having LSP assistance would greatly enhance my quality of life.
I am just using duckdb on a 3TB dataset in a beefy ec2, and am pleasantly surprised at its performance on such a large table. I had to do some sharding to be sure but am able to match performance of snowflake or other cluster based systems using this single machine instance.
To clarify Clickhouse will likely match this performance as well, but doing things on a single machines look sexier to me than it ever did in decades.
Where does your data reside, is it on an attached EBS volume, or in S3, or somewhere else?
I had some spare time and tinkered with duckdb with a 70GB dataset, but just getting the 70GB on to the EC2 took hours. Would be pretty rocking if duckdb team could somehow set up a ~1TB sized demo that anyone can setup and try for themselves in, say, under an hour.
Local drives. DONT USE EBS! you’ll incur a huge IO charge. You have to choose instances with attached nvme storage which means one of the storage optimized instances.
Reading the data off s3 will mean you will be slower than offerings like snowflake. Snowflake has optimized the crap out of doing analytics in s3, so you can’t beat it with something as simple as duckdb.
Importantly you need the data in some distributed format like parquet or split csv. Otherwise duckdb can’t read it in parallel.
Hi – DuckDB Labs devrel here. It's great that you find DuckDB useful!
On the setup side, I agree that local (instance-attached) disks should be preferred but does EBS incur an IO fee? It incurs a significant latency for sure but it doesn't have a per-operation pricing:
> I/O is included in the price of the volumes, so you pay only for each GB of storage you provision.
(https://aws.amazon.com/ebs/pricing/)
Also, I learned that Hive-partitioned Parquet on S3 is much slower than on disk.
S3 is high latency unless you use for S3 Express Zones (the low latency version).
We used EFS (not EBS) and it was much faster.
Test out the nvme drives though. It’s blazing.
I tried to spread large dataset into thousands of files on S3 and use StepFunctions Distributed Map to launch thousands of Lambda instances to process those files in parallel, using DuckDB (or other libs) in Lambda. The parallel loading and processing is way faster than doing this in a single big EC2 instance.
Lambda isn’t infinitely parallel. I thought it doesn’t do more than 100 parallel runners? I4i.metal has 96 cores and can be faster than that.
As per AWS said in https://aws.amazon.com/cn/blogs/aws/aws-lambda-functions-now...
> Each synchronously invoked Lambda function now scales by 1,000 concurrent executions every 10 seconds.
I’ve tried reading streamed parquet via PyArrow with Duck, and it’s been pretty promising. Depending on the query, you won’t need to download everything off HTTP.
we use partitioned parquet files in s3. we use a csv in the bucket root to track the files. i’m sure there’s a better way but for now the 2tb of data are stored cheaply and we get fast reads by only reading the partitions we need to read.
I'm curious how much simpler to build, manage, and run vs cost it would be to simply running a database on a large vultr/DO instance and paying for 2tb of storage?
I feel like you'd get away with the whole thing for around $500/mo depending on how much compute was needed?
well that's not the infrastructure we have. we are primarily an aws shop so we use the resources available to us in the context of our infrastructure decisions. it would be a hard sell to buy something outside of that ecosystem.
You just need to try it once to see the issue. Merely loading this amount of data onto a Postgres db will be hell.
Huge fan of Clickhouse, but the minute you have to deal with somebody else's CSV is when Duck wins over Clickhouse.
The test case of a simple aggregation is a good example of an important data science skill knowing when and here to use a given tool, and that there is no one right answer for all cases. Although it's worth noting that DuckDB and polars are comparable performance-wise for aggregation (DuckDB slightly faster: https://duckdblabs.github.io/db-benchmark/ ).
For my cases with polars and function piping, certain aspects of that workflow are hard to represent in SQL, and additionally it's easier for iteration/testing on a given aggregation to add/remove a given function pipe, and to relate to existing tables (e.g. filter a table to only IDs present in a different table, which is more algorithmically efficient than a join-then-filter). To do the ETL I tend to do for my data science workin pandas/polars in SQL/DuckDB, it would require chains of CTEs or other shenanigans, which eliminates similicity and efficincy.
The real winner is going to be a framework that, during dev, transparently materializes CTEs to temporary tables so you can iterate on them like you’re saying, while continuing to harness SQL for the end product.
Do dbt or SQLMesh do this, or if not can you say more about what you’re envisioning?
> Note that DuckDB automatically figured out how to parse the date column.
It kinda did and it kinda didn't. Author got lucky that Transaction.csv contained a date where the day was after the 12th in a given month. Had there not been such a date, DuckDB would have gotten the dates wrong and read it as dd/mm/yyyy.
I think a warning from DuckDB would have been in order.
Why not both? https://ibis-project.org/
My biggest issue with DuckDB is its not willing to implement edits to blob storages which allow edits (Azure). Having common object/blob storages that can be interacted and operated by multiple process will make it much more amenable to many data science driven workflows.
I don't understand the purpose of this post. "I write a lot of X so I prefer using X over Y." Great.
It's an expression of a personal experience, preferences, and thoughts on a personal blog, thrown for others that might care about DuckDb and Pandas/Polars (and many did, as it got in the HN's first page).
They didn't write it to be some novel research, some canonical tutorial about the tech, or to teach/amuse each and every random reader.
At what database size does it make sense to move from SQLite to DuckDB? My use case is off-line data analysis, not query / response web app.
It's not so much about size but about usage pattern.
If your workloads require fast writes and reads, SQLite will probably work fine.
If you're looking to run analytic, columnar queries (which tend to involve a lot of aggregation and joins on a few columns (say less than 50) at a time), then DuckDB is way more optimized.
Oversimplifying, Sqlite is more OLTP and DuckDB is more OLAP.
lack of UDF is an issue
they have UDFs: https://duckdb.org/docs/api/python/function.html
And macros! Which you can overload too
https://duckdb.org/docs/sql/statements/create_macro#overload...