Why DuckDB is my first choice for data processing

(robinlinacre.com)

59 points | by tosh 6 hours ago ago

19 comments

  • mrtimo 14 minutes ago

    What I love about duckdb:

    -- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).

    -- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.

    -- if the files don't have the same schema, union_by_name is amazing.

    -- The .csv parser is amazing. Auto assigns types well.

    -- It's small! The Web Assembly version is 2mb! The CLI is 16mb.

    -- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data.

  • noo_u 30 minutes ago

    I'd say the author's thoughts are valid for basic data processing. Outside of that, most of claims in this article, such as:

    "We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."

    become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.

    The author also has another article where they claim that:

    "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)

    This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.

    • hnthrowaway0315 16 minutes ago

      SQL is popular because everyone can learn and start using it after a while. I agree that Python sometimes is a better tool but I don't see SQL going away anytime.

      From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.

  • tjchear an hour ago

    I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?

    • biophysboy an hour ago

      Zonemaps are created for columns automatically. I process somewhat large tables w/ duckdb regularly (100M rows) and never have any problems.

      • riku_iki 7 minutes ago

        that's true for duckdb native tables, but the question was about json.

    • simlevesque 28 minutes ago

      But when indexing your json or csv, if you have say 10 rows, each row is separated on your disk instead of all together. So a scan for one columb only needs to read a tenth of the disk space used for the data. Obviously this depends on the columns' content.

    • akhundelar an hour ago

      Not a duckdb user, but I use polars a lot (mentioned in the article).

      Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.

    • mpalmer an hour ago

      I guess the question is: how much is medium? DuckDB can handle quite a lot of data without breaking a sweat. Certainly if you prefer writing SQL for certain things, it's a no-brainer.

  • s-a-p an hour ago

    "making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data" > I'm a Data Engineering noob, but DuckDB alone doesn't do metadata & catalog management, which is why they've also introduce DuckLake.

    Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.

    • biophysboy 41 minutes ago

      DuckLake is pretty new, so I guess it would depend on if you need a more mature, fully-featured app.

  • DangitBobby 3 hours ago

    Being able to use SQL on CSV and json/jsonl files is pretty sweet. Of course it does much more than that, but that's what I do most often with it. Love duckdb.

    • samuell an hour ago

      Indeed! I generally like awk a lot for simpler CSV/TSV processing, but when it comes to cases where you need things like combining/joining multiple CSV files or aggregating for certain columns, SQL really shines IME.

  • biophysboy an hour ago

    I think my favorite part of duckdb is its flexibility. Its such a handly little swiss army knife for doing analytical processing in scientific environments (messy data w/ many formats).

  • smithclay an hour ago

    Agree with the author, will add: duckdb is an extremely compelling choice if you’re a developer and want to embed analytics in your app (which can also run in a web browser with wasm!)

    Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.

    • canadiantim 23 minutes ago

      The wasm is pretty heavy data-wise tho, I’m hoping eventually it’ll be lighter for easier loading on not so good devices.

  • oulu2006 3 hours ago

    That's really interesting, I love the idea of being able to use columnar support directly within postgresql.

    I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.

    • biophysboy 43 minutes ago

      Its a really handy tool. I've queried basically everything you can w/ duckdb - csv, json, s3 buckets, MS SQL servers, excel sheets, pandas dataframes, etc - and have had very few issues.

  • clumsysmurf an hour ago

    DuckDB has experimental builds for Android ... I'm wondering how much work it would take to implement a Java API for it similar to sqlite (Cursor, etc).