650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark

(dataengineeringcentral.substack.com)

46 points | by tanelpoder 3 hours ago ago

8 comments

Honestly this benchmark feels completely dominated by the instance's NIC capacity.

They used a c5.4xlarge which has a 10Gbps, which at constant 100% saturation would take in the ballpark of 9 minutes to pull all of that data from S3, so that is your best case scenario for pulling the data (without even considering writing it back!)

Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.

The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.

andy99 30 minutes ago

Awk? https://adamdrake.com/command-line-tools-can-be-235x-faster-...

jdnier 18 minutes ago

DuckDb has a new "DuckLake" catalog format that would be another candidate to test. https://ducklake.select/

esafak an hour ago

If I understand correctly, polars relies on delta-rs for Delta Lake support, and that is what does not support Deletion vectors: https://github.com/delta-io/delta-rs/issues/1094

It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.

[-]

mynameisash 17 minutes ago

> It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.

I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.

Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.

Right cool for the right job, people.

[-]

esafak a minute ago

I used pySpark some time ago when it was introduced to my company at the time and I realized that it was slow when you used python libraries in the UDFs rather than pySpark's own functions.

co0lster 27 minutes ago

650GB relates to size of parquet files which are compressed in reality it’s way more.

32 GB of parquet cannot fit in 32GB of RAM

blmarket 41 minutes ago

Presto (a.k.a. AWS Athena) might be a faster/better alternative? Also would like to see if 650GB data is available locally.