As much as DuckDB is cute I've mostly come to believe that Clickhouse is the perfect thing to pair Postgres with. This is especially true now that they've acquired PeerDB and are integrating it into the Clickpipes cloud product.
DuckDB is neat, and I understand why a company like BemiDB would build their product on top of it, but as a prospective customer embedded databases are a weird choice for serious workloads when there are other good open-source solutions like Clickhouse available.
ClickHouse is definitely a popular choice nowadays. I'm curious whether you self-host ClickHouse or use their Cloud? We wanted to make BemiDB as simple to run as possible with a single binary and object storage (vs large machines, big disks, clustering, running Temporal for CDC, etc.)
Clickhouse has an embedded version (https://github.com/chdb-io/chdb), the issue with duck is that it's too buggy for production loads. You can see a nice list of the issues here:
We actually separate Read/Write paths. BemiDB reads by levering DuckDB as a query engine. And it writes to Iceberg completely separately from DuckDB. I'm curious if that's what you imagined.
So it looks like you don't use postgres extensions so you can run this on an EC2 against an Aurora Postgres instance and dump files to S3 Iceberg right?
And can you then have Glue Catalog auto-crawl them and expose them in Athena? Or are they DuckDB-managed Iceberg tables essentially?
The Iceberg tables are created separately from the DuckDB query engine. So you should be able to read these Iceberg tables by using any other Iceberg-compatible tools and services like AWS Athena.
Really cool! I have an IoT use-case where I ingest data, I want to keep like the last 3 months or so in Postgresql, and then store the old data as parquet files on S3
I planned initially to do chunks on S3 and do the analytical queries using duckdb, I'm wondering if your tool would be a good replacement?
For now I don't have that many analytical queries, I'm mostly doing visualization of the data points by querying a range (eg last 2 weeks of data for a device)
Does it then make sense to use columnar storage or am I better off with "regular Postgres"?
Or in my case does your approach provide "best of both worlds" in the sense that I could do some occasional analytical queries on past data stored on S3, and regularly access "last 3 months" data for visualization using the data stored in the regular Postgres?
1) You could use BemiDB to sync your Postgres data (e.g., partition time-series tables) to S3 in Iceberg format. Iceberg is essentially a "table" abstraction on top of columnar Parquet data files with a schema, history, etc.
2) If you don't need strong consistency and fine with delayed data (the main trade-off), you can use just BemiDB to query and visualize all data directly from S3. From a query perspective, it's like DuckDB that talks Postgres (wire protocol).
Feel free to give it a try! And although it's a new project, we plan to keep building and improving it based on user feedback.
- Can you give me more info about the strong consistency and delayed data, so I can better picture it with a few examples?
- Also, is it possible to do the sync with the columnar data in "more-or-less real-time" (eg do a NOTIFY on a new write in my IoT events table, and push in the storage?)
- Would your system also be suited for a kind of "audit-log" data? Eg. if I want to have some kind of audit-table of all the changes in my database, but only want to keep a few weeks worth at hand, and then push the rest on S3, or it doesn't make much sense with that kind of data?
For now, BemiDB supports only full Postgres table data re-sync. We plan to enable real-time data syncing from Postgres into S3 by using logical replication (CDC), which is much more reliable than PG NOTIFY.
We use logical replication and this exact approach with our other project related to auditing and storing Postgres data changes https://github.com/BemiHQ/bemi. We're thinking about combining these approaches to leverage scalable and affordable separated storage layer on S3.
Lmk if that makes sense or if you had any more questions!
Really interesting thanks! I guess my use-case would rather require incremental updates
Ideally it would just sync in real-time and buffer new data in the Bemi binary (with some WAL-like storage to make sure data is preserved on binary crash/reload), and when it has enough, push them on S3, etc
Is this the kind of approach you're going to take?
Yes, we want to use the approach like you described! We'll likely wait until enough changes are accumulated by using 2 configurable thresholds: time (like 30s) and size (like 100MB)
Use a clickhouse FDW or something similar, clickhouse has excellent integration with postgres. They also have a great embedded Python version. Their marketing isn't as good as Duckdb but in terms of stability and performance they are so much better. Duckdb is very very buggy and full of sharp edges but because of their VC funded developer marketing, you don't really hear people talking about it.
I agree that DuckDB may sometimes be buggy because it's being actively developed with a large surface area (a universal embeddable query engine that works with different storage layers if I were to simplify).
However, DuckDB (non-profit foundation) != MotherDuck (VC funded). These are two separate organizations with different goals. I see DuckDB as a tool, not as a SaaS or a VC-funded company. My hope is that it'll be adopted by other projects and not associated with just a single for-profit company.
While it’s technically true that it’s an OSI license it’s mostly used to scare away competing cloud vendors from hosting the software, which isn’t in spirit of OSS.
Have you looked into the more modern choices?
Like the Business Source License that MariaDB created and uses or the Functional Source License that Sentry created as an improvement over the Business Source License? https://fsl.software/
Both those licenses have a fair source phase that automatically resolves into an open source phase over time.
Thus one gets the best of two worlds: An honest descriptive license for protecting one’s business model + a normal permissive OSS license that ensures longevity and prevents lock-in.
Our philosophy in general is to go to a more open license over time (vs the other direction). So we might consider other more permissive OSI-approved licenses.
Would you be able to share why AGPL license is a no-go for you? I'm genuinely curious about your use case. In simple words, it'd require a company to open source their BemiDB code only if they made modifications and were distributing it to other users (allowing modifications and using it internally without any restrictions)
Definitely checking this out today! I use postgres for ~30 GB of machine learning data (object detection) and have a couple workflows which go through the Postgres->Parquet->DuckDB processing route.
A couple questions, if you have time:
1. How do you guys handle multi-dimensional arrays? I've had issues with a few postgres-facing interfaces (libraries or middleware) where they believe everything is a 1D array!
2. I saw you are using pg_duckdb/duckdb under the hood. I've had issues calling plain-SQL functions defined on the postgres server, when duckdb is involved. Does BemiDB support them?
This is probably the most streamlined/all-inclusive solution out of all that I've seen, but this has definitely been an extremely saturated space in 2024
Moving data between systems is problematic. Where this product is actually needed (multi-TB databases under load) is where logical replication won't be able to sync your tables in time. Conversely, small databases where this will work don't really need columnar storage optimizations.
Fair point. We think that BemiDB currently can be useful when used with small and medium Postgres databases. Running complex analytics queries on Postgres can work, but it usually requires tuning it and adding indexes tailored to these queries, which may negatively impact the write performance on the OLTP side or may not be possible if these are ad-hoc queries.
> (multi-TB databases under load) is where logical replication won't be able to sync your tables in time
I think the ceiling for logical replication (and optimization techniques around it) is quite high. But I wonder what people do when it doesn't work and scale?
How does this replicate Postgres data? I glanced at the code and saw that it exports to a CSV file then writes out an Iceberg table for an initial snapshot--does it use Postgres logical replication?
We love the pg_moooncake extension (and pg_duckdb used under the hood). Although our approaches are slightly different. Long-term, we want to allow anyone to use BemiDB by using native Postgres logical replication without installing any extensions (many Postgres hosting providers impose their restrictions, upgrading versions might be challenging, OLAP queries may affect OLTP performance if within the same database, etc.)
Congratulations! I was looking and pg_analytics from ParadeDB hoping this use case would be solved (the dump from pg to parquet part), but it doesnt yet do it.
The pg_analytics Postgres extension partially supports different file formats. We bet big on Iceberg open table format, which uses Parquet data files under the hood.
Our initial approach is to do periodic full table resyncing. The next step is to support incremental Iceberg operations like updates. This will involve creating a new "diff" Parquet file and using the Iceberg metadata to point to the new file version that changes some rows. Later this will enable time travel queries, schema evolution, etc.
Query Engine: embeds the DuckDB query engine to run analytical queries.
Storage Layer: uses the Iceberg table format to store data in columnar compressed Parquet files.
Does TPC-H SF1 really take _one and a half hours_ for you on regular Postgres? Last time I tried (in the form of DBT-3), it was 22 queries and most of them ran in a couple seconds.
Interesting. I haven't used the DBT-3 kit, does it add any indexes? I manually added these Postgres indexes https://github.com/BemiHQ/BemiDB/blob/main/benchmark/data/cr... to reduce the main bottlenecks on SF0.1 and reduce the total time from 1h23m13s to 1.5s. But SF1 still took more than 1h
It adds a bunch of indexes, yes. I don't think anyone really runs TPC-H unindexed unless they are using a database that plain doesn't support it; it wouldn't really give much meaningful information.
Edit: I seemingly don't have these benchmarks anymore, and I'm not going to re-run them now, but I found a very (_very_) roughly similar SF10 run clocking in around seven minutes total. So that's the order of magnitude I would be expecting, given ten times as much data.
We love ParadeDB and their team. Their primary focus is search (Elasticsearch on Postgres), but they also have the pg_analytics Postgres extension (foreign data wrappers and embedded DuckDB).
The biggest difference is in a Postgres extension vs a separate OLAP process. We want to allow anyone with just Postgres to be able to perform analytics queries without affecting resources in the transactional database, building and installing extensions (might not be possible with some hosting providers), dealing with dependencies and their versions when upgrading Postgres, manually syncing data from Postgres to S3, etc.
As much as DuckDB is cute I've mostly come to believe that Clickhouse is the perfect thing to pair Postgres with. This is especially true now that they've acquired PeerDB and are integrating it into the Clickpipes cloud product.
DuckDB is neat, and I understand why a company like BemiDB would build their product on top of it, but as a prospective customer embedded databases are a weird choice for serious workloads when there are other good open-source solutions like Clickhouse available.
ClickHouse is definitely a popular choice nowadays. I'm curious whether you self-host ClickHouse or use their Cloud? We wanted to make BemiDB as simple to run as possible with a single binary and object storage (vs large machines, big disks, clustering, running Temporal for CDC, etc.)
Clickhouse has an embedded version (https://github.com/chdb-io/chdb), the issue with duck is that it's too buggy for production loads. You can see a nice list of the issues here:
https://news.ycombinator.com/item?id=41490707
Cool. Every database or data source (e.g. CRM) should produce Iceberg format for you.
Though a little sceptical of embedding DuckDB. It is easy and better to isolate Read/Write paths, and it has a lot of other benefits.
Iceberg for the win!
We actually separate Read/Write paths. BemiDB reads by levering DuckDB as a query engine. And it writes to Iceberg completely separately from DuckDB. I'm curious if that's what you imagined.
So it looks like you don't use postgres extensions so you can run this on an EC2 against an Aurora Postgres instance and dump files to S3 Iceberg right?
And can you then have Glue Catalog auto-crawl them and expose them in Athena? Or are they DuckDB-managed Iceberg tables essentially?
Exactly! You can run it on any server connecting to any Postgres, without installing custom extensions (AWS Aurora supports only a limited number of extensions https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...).
The Iceberg tables are created separately from the DuckDB query engine. So you should be able to read these Iceberg tables by using any other Iceberg-compatible tools and services like AWS Athena.
Really cool! I have an IoT use-case where I ingest data, I want to keep like the last 3 months or so in Postgresql, and then store the old data as parquet files on S3
I planned initially to do chunks on S3 and do the analytical queries using duckdb, I'm wondering if your tool would be a good replacement?
For now I don't have that many analytical queries, I'm mostly doing visualization of the data points by querying a range (eg last 2 weeks of data for a device)
Does it then make sense to use columnar storage or am I better off with "regular Postgres"?
Or in my case does your approach provide "best of both worlds" in the sense that I could do some occasional analytical queries on past data stored on S3, and regularly access "last 3 months" data for visualization using the data stored in the regular Postgres?
Thank you!
Yes, absolutely!
1) You could use BemiDB to sync your Postgres data (e.g., partition time-series tables) to S3 in Iceberg format. Iceberg is essentially a "table" abstraction on top of columnar Parquet data files with a schema, history, etc.
2) If you don't need strong consistency and fine with delayed data (the main trade-off), you can use just BemiDB to query and visualize all data directly from S3. From a query perspective, it's like DuckDB that talks Postgres (wire protocol).
Feel free to give it a try! And although it's a new project, we plan to keep building and improving it based on user feedback.
Thanks!
- Can you give me more info about the strong consistency and delayed data, so I can better picture it with a few examples?
- Also, is it possible to do the sync with the columnar data in "more-or-less real-time" (eg do a NOTIFY on a new write in my IoT events table, and push in the storage?)
- Would your system also be suited for a kind of "audit-log" data? Eg. if I want to have some kind of audit-table of all the changes in my database, but only want to keep a few weeks worth at hand, and then push the rest on S3, or it doesn't make much sense with that kind of data?
For now, BemiDB supports only full Postgres table data re-sync. We plan to enable real-time data syncing from Postgres into S3 by using logical replication (CDC), which is much more reliable than PG NOTIFY.
We use logical replication and this exact approach with our other project related to auditing and storing Postgres data changes https://github.com/BemiHQ/bemi. We're thinking about combining these approaches to leverage scalable and affordable separated storage layer on S3.
Lmk if that makes sense or if you had any more questions!
Really interesting thanks! I guess my use-case would rather require incremental updates
Ideally it would just sync in real-time and buffer new data in the Bemi binary (with some WAL-like storage to make sure data is preserved on binary crash/reload), and when it has enough, push them on S3, etc
Is this the kind of approach you're going to take?
Yes, we want to use the approach like you described! We'll likely wait until enough changes are accumulated by using 2 configurable thresholds: time (like 30s) and size (like 100MB)
Use a clickhouse FDW or something similar, clickhouse has excellent integration with postgres. They also have a great embedded Python version. Their marketing isn't as good as Duckdb but in terms of stability and performance they are so much better. Duckdb is very very buggy and full of sharp edges but because of their VC funded developer marketing, you don't really hear people talking about it.
I agree that DuckDB may sometimes be buggy because it's being actively developed with a large surface area (a universal embeddable query engine that works with different storage layers if I were to simplify).
However, DuckDB (non-profit foundation) != MotherDuck (VC funded). These are two separate organizations with different goals. I see DuckDB as a tool, not as a SaaS or a VC-funded company. My hope is that it'll be adopted by other projects and not associated with just a single for-profit company.
The AGPL license is a no-go for me.
While it’s technically true that it’s an OSI license it’s mostly used to scare away competing cloud vendors from hosting the software, which isn’t in spirit of OSS.
Have you looked into the more modern choices?
Like the Business Source License that MariaDB created and uses or the Functional Source License that Sentry created as an improvement over the Business Source License? https://fsl.software/
Both those licenses have a fair source phase that automatically resolves into an open source phase over time.
Thus one gets the best of two worlds: An honest descriptive license for protecting one’s business model + a normal permissive OSS license that ensures longevity and prevents lock-in.
Because you want to take their hard work, modify it and not share it back to the community?
I'm not crying that "it's not for you".
Our philosophy in general is to go to a more open license over time (vs the other direction). So we might consider other more permissive OSI-approved licenses.
Would you be able to share why AGPL license is a no-go for you? I'm genuinely curious about your use case. In simple words, it'd require a company to open source their BemiDB code only if they made modifications and were distributing it to other users (allowing modifications and using it internally without any restrictions)
Please, don’t. AGPL is great and you’re fine using it.
You’re seriously calling out a perfectly valid OSS for not being “in the spirit of OSS”, and pitching for licenses that are explicitly NOT OSS?!
AGPL couldn’t be more in the spirit of OSS. The entire free software movement started to defend the _users_ freedom, not individual companies’.
Definitely checking this out today! I use postgres for ~30 GB of machine learning data (object detection) and have a couple workflows which go through the Postgres->Parquet->DuckDB processing route.
A couple questions, if you have time:
1. How do you guys handle multi-dimensional arrays? I've had issues with a few postgres-facing interfaces (libraries or middleware) where they believe everything is a 1D array!
2. I saw you are using pg_duckdb/duckdb under the hood. I've had issues calling plain-SQL functions defined on the postgres server, when duckdb is involved. Does BemiDB support them?
Thanks for sharing, and good luck with it!
Thank you, please give it a try!
Great questions:
1. We currently don't support multi-dimensional arrays, but we plan to add support for such complex data structures.
2. Would you be able to share what type of user-defined functions are these, do they do modify the data or read it?
This is probably the most streamlined/all-inclusive solution out of all that I've seen, but this has definitely been an extremely saturated space in 2024
mostly everyone riding on duckdb's tailcoat
Moving data between systems is problematic. Where this product is actually needed (multi-TB databases under load) is where logical replication won't be able to sync your tables in time. Conversely, small databases where this will work don't really need columnar storage optimizations.
For my use case of something similar on Clickhouse:
We load data from postgres tables that are used to build Clickhouse Dictionaries (a hash table for JOIN-ish operations).
The big tables do not arrive via real-time-ish sync from postgres but are bulk-appended using a separate infrastructure.
Fair point. We think that BemiDB currently can be useful when used with small and medium Postgres databases. Running complex analytics queries on Postgres can work, but it usually requires tuning it and adding indexes tailored to these queries, which may negatively impact the write performance on the OLTP side or may not be possible if these are ad-hoc queries.
> (multi-TB databases under load) is where logical replication won't be able to sync your tables in time
I think the ceiling for logical replication (and optimization techniques around it) is quite high. But I wonder what people do when it doesn't work and scale?
How does this replicate Postgres data? I glanced at the code and saw that it exports to a CSV file then writes out an Iceberg table for an initial snapshot--does it use Postgres logical replication?
Full table re-syncing is our initial solution. Using Postgres logical replication is next on our roadmap!
very cool!! We have the same vision with pg_moooncake: https://github.com/Mooncake-Labs/pg_mooncake/tree/main
From what I understand, the BemiDB experience is akin to PeerDB + Clickhouse. It's not really a Postgres extension?
Glad open table formats are becoming mainstream, for everyone.
Thanks!
We love the pg_moooncake extension (and pg_duckdb used under the hood). Although our approaches are slightly different. Long-term, we want to allow anyone to use BemiDB by using native Postgres logical replication without installing any extensions (many Postgres hosting providers impose their restrictions, upgrading versions might be challenging, OLAP queries may affect OLTP performance if within the same database, etc.)
Congratulations! I was looking and pg_analytics from ParadeDB hoping this use case would be solved (the dump from pg to parquet part), but it doesnt yet do it.
How does it handle updates?
Thank you!
The pg_analytics Postgres extension partially supports different file formats. We bet big on Iceberg open table format, which uses Parquet data files under the hood.
Our initial approach is to do periodic full table resyncing. The next step is to support incremental Iceberg operations like updates. This will involve creating a new "diff" Parquet file and using the Iceberg metadata to point to the new file version that changes some rows. Later this will enable time travel queries, schema evolution, etc.
Fantastic!
Query Engine: embeds the DuckDB query engine to run analytical queries. Storage Layer: uses the Iceberg table format to store data in columnar compressed Parquet files.
Smart. Imma test this out for sure.
Thanks! Give it a try and let us know any feedback :)
Can you give an example if I have 5gig (2 million rows)
How will it be created differently for columnar access?
We ran some benchmarks (TPC-H, designed for OLAP) with ~10M records https://github.com/BemiHQ/BemiDB#benchmark
The BemiDB storage layer produced ~300MB columnar Parquet files (with ZSTD compression) vs 1.6GB of data in Postgres.
Does TPC-H SF1 really take _one and a half hours_ for you on regular Postgres? Last time I tried (in the form of DBT-3), it was 22 queries and most of them ran in a couple seconds.
Interesting. I haven't used the DBT-3 kit, does it add any indexes? I manually added these Postgres indexes https://github.com/BemiHQ/BemiDB/blob/main/benchmark/data/cr... to reduce the main bottlenecks on SF0.1 and reduce the total time from 1h23m13s to 1.5s. But SF1 still took more than 1h
It adds a bunch of indexes, yes. I don't think anyone really runs TPC-H unindexed unless they are using a database that plain doesn't support it; it wouldn't really give much meaningful information.
Edit: I seemingly don't have these benchmarks anymore, and I'm not going to re-run them now, but I found a very (_very_) roughly similar SF10 run clocking in around seven minutes total. So that's the order of magnitude I would be expecting, given ten times as much data.
Got it, thanks for sharing it! We'll try to look into DBT-3 and the indexes it creates to test with SF10
How does this compare to ParadeDB? Seems to occupy the same space
We love ParadeDB and their team. Their primary focus is search (Elasticsearch on Postgres), but they also have the pg_analytics Postgres extension (foreign data wrappers and embedded DuckDB).
The biggest difference is in a Postgres extension vs a separate OLAP process. We want to allow anyone with just Postgres to be able to perform analytics queries without affecting resources in the transactional database, building and installing extensions (might not be possible with some hosting providers), dealing with dependencies and their versions when upgrading Postgres, manually syncing data from Postgres to S3, etc.
difference to something like duckdb?
> > Alternatives
> DuckDB:
> - Designed for OLAP use cases. Easy to run with a single binary.
> - Limited support in the data ecosystem (notebooks, BI tools, etc.). Requires manual data syncing and schema mapping for best performance.
^ This!
Here is the link that briefly describes pros and cons of different alternatives for analytics https://github.com/BemiHQ/BemiDB#alternatives
How can you setup automatic replication from Postgresql to a duckdb instance?
The most common approach is to read Postgres data in DuckDB https://duckdb.org/docs/extensions/postgres.html