pg_flo – Stream, transform, and re-route PostgreSQL data in real-time

(pgflo.io)

239 points | by shayonj a year ago ago

54 comments

2000swebgeek a year ago

pg_flo looks like a very interesting project we might end up using. I like the archival to s3 and cheap versions of tools that get done at lower scale like GBs instead of TBs worth of data (Which is what I think Debezium does and something I can easily test locally)

The work/effort I need to put in with kafka etc., w/ Debezium is a short term effort, but evaluating the hassle.

Other solutions being evaluated by my team is peerdb - https://docs.peerdb.io/why-peerdb

[-]

mritchie712 a year ago

peerdb was acquired by clickhouse, so if you're syncing to a destination other than clickhouse, you may see support for it wane

bzzzt a year ago

There's a standalone version of Debezium nowadays which can publish directly to a NATS queue. No Kafka required anymore.

shayonj a year ago

Thanks for checking it out - Curious are you syncing/storing the data in some specific format in s3?

iyn a year ago

Looks interesting, I’ve been investigating Postgres CDC solutions for a while, so I’m curious if this could help my use-case.

Can you elaborate on failure modes? What happens if e.g. NATS server (or worker/replicator) node dies?

In principle, how hard is it move data from Postgres not to another PG but e.g. ElasticSearch/ClickHouse?

[-]

shayonj a year ago

re: worker/replicator dying, if the worker dies, nothing is basically reading from subject/queue (NATS), so the changes aren't reaching the destination. The default config is to have one worker. If the replicator dies, the changes from the replication slot isn't getting delivered to NATS, so the WAL size will grow on the source DB.

There are some plans to introduce metrics/instrumentation to stay on top of these things, to make it more production ready.

Replicator and worker can pause/resume the stream when say shutting down, say during a deploy.

re: moving data, I haven't attempted it, but I have seen PeerDB mentioned for moving data to Clickhouse and it seems quite nice.

I am starting w/ PostgreSQL <> PostgreSQL to mostly get the fundamentals right first. Would love to hear any use cases you have in mind.

[-]

iyn a year ago

I have these use cases:

1. Syncing data in Postgres to ElasticSearch/ClickHouse (which handle search/analytics on the data I store in PG)

2. Invoking my own workflow engine — I have built a system that allows end-users to define triggers that start workflows when some data change/is created. To determine whether I need to start the workflow, I need to inspect every CRUD operation and check it against triggers defined by the users.

I'm currently doing that in a duck-tape like way by publishing to SNS from my controllers and having SQS subscribers (mapped to Lambdas) that are responsible for different parts of my "pipeline". I don't like this system as it's not fault-tolerant and I'd prefer to do this by processing WAL and explicitly acknowledging processed changes.

[-]

shayonj a year ago

I think PeerDB is a smart choice if you are looking to sync the data to Clickhouse. They have nice integration support as well.

re: 2. I have some plans around a control plane that allows users to define these transformation rules and routing config and then take further actions based on the outcomes. If you are interested in it, feel free to sign up on the homepage. Also, very happy for some quick chats too (shayonj at gmail). Thanks

[-]

iyn a year ago

Yeah, I've looked into PeerDB but in terms of self-hosting it's not really lightweight, as they depend on Temporal [0]. I'm currently optimizing for less complexity/budget, as I have just a few customers.

[0] https://docs.peerdb.io/architecture#dependencies

[-]

saisrirampur a year ago

Sai from PeerDB here. Temporal has been very impactful for us and a major factor in our ability to build a production-grade product that supports large-scale workloads.

At a high level, CDC is a complex state machine. Temporal helps building the state machine taking care of auto-retries/idempotency at different failure points and also aids in managing and observing it. This is very useful to identify root causes when issues arise.

Managing Temporal shouldn’t be complex. They offer a well-maintained, mature Docker container. From a user standpoint, the software is intuitive and easy to understand. We package the temporal docker container in our own Docker setup and have it integrated into our Helm charts. We’ve quite a few users smoothly using Enterprise (that we open sourced recently) and standard OSS!

https://github.com/PeerDB-io/peerdb/blob/main/docker-compose...

https://github.com/PeerDB-io/peerdb-enterprise

Let me know if there are any questions!

[-]

iyn a year ago

Thanks for reaching out! Just to be clear, from what I can tell, both PeerDB and Temporal are great (and I’ve been hoping to learn Temporal for a while). At some point I considered self-hosting PeerDB but my impression was that it required multiple nodes to run properly and so it wasn’t budget friendly - this is also based on your pricing plans with $250 being the cheapest which suggests that it’s not cheap to host it (I’m trying to minimize costs until I have more customers). Please correct me if I’m wrong! Can you give me an example of a budget friendly deployment, e.g. how many EC2 instances for PeerDB would I need for one of the smaller RDS instances?

Given the acquisition by ClickHouse (congrats!), what can we expect for the CDC for sinks other than CH? Do you plan to continue supporting different targets or should we expect only CH focus?

Edit: also, any plans for supporting e.g. SNS/SQS/NATS or similar?

[-]

saisrirampur a year ago

Great question! PeerDB can be just run on a single EC2 instance (using either Docker or Helm charts). A typical production-grade setup could use 4 vCores with 16GB RAM, You can scale up or down based on resource usage for your workload.

To elaborate more on the architecture (https://docs.peerdb.io/architecture), the flow-worker does most of the heavy lifting (actual data movement during initial load and CDC), while the other components are fairly lightweight. Allocating around 70-80% of provisioned resources to the flow-worker is a good estimate. For Temporal, you could allocate 20-25% of resources and distribute the rest to other components.

Our open-source offering (https://github.com/PeerDB-io/peerdb) supports multiple connectors (CH and non-CH). Currently, there aren’t any plans to make changes on that front!

[-]

iyn a year ago

Thanks for guidance on sizing the instance(s), I'll experiment with that in the coming weeks.

merb a year ago

1. i am not sure if the helm chart can be used for the oss version? 2. if a helm chart needs sh files, it’s already an absolut no-go since it won’t work with gitops that well.

[-]

assaxor a year ago

Hi, the helm chart uses the OSS PeerDB images. The sh files were created to bootstrap the values files for easier (and faster) POCs.

You can append a `template` argument when running the script files which will lead to a set of values file being generated, which you can then modify accordingly. There is a production guide for the same as we have customers in production using GitOps (ArgoCD) with the provided charts (https://github.com/PeerDB-io/peerdb-enterprise/blob/main/PRO...)

shayonj a year ago

Yeah, totally fair.

Are you ok with a NATS dependency ? Happy to work with you in supporting a new destination like ES.

Also looking to make NATS optional for smaller/simpler setups (https://github.com/shayonj/pg_flo/issues/21)

[-]

iyn a year ago

Yes, I think NATS is reasonable — I don't have operational experience with it but based on my earlier reading it seems that it can be run on a smaller budget. Is this "regular" NATS or the Jetstream variant?

[-]

shayonj a year ago

Perf! From testing on some of my staging workloads, the footprint isn't too high and I can get 5-6k messages/s. Esp. since there is only one worker instance involved (for strict ordering).

Yes, it does use NATS JetStream.

Onavo a year ago

You can always use Supabase's realtime CDC tool, it's open source.

[-]

iyn a year ago

Thanks for the suggestion. I've seen it on GitHub but haven't really looked into operating it. I was under the impression it's more focused on the real-time aspect than CDC. Do you have any experience managing it? I'm curious how easy is it to self-host it and integrate with rest of my system.

scirob a year ago

Cool, Hope it can give an alterantive to Debezium. I never liked Debezium how it first must copy the whole CDC state to kafka. And you must set the kafka retnetion time to infinity, which many kafka as a service systems don't allow anyway.

[-]

FridgeSeal a year ago

> how it first must copy the whole CDC state to kafka

There’s a setting that controls whether it will do a snapshot first. Turn it off and it will just start sending through new cdc entries.

> you must set the kafka retention time to infinity

Is this a new retirement? I’ve never had to do this.

jitl a year ago

I recently set up a Debezium / Kafka Connect cluster and didn’t need to do either of these things. I have some topics with 7 day retention, and some with 0 second retention. No full scan at all, not necessary for my use case.

scirob a year ago

pglogical can live inside postgres, looks like pg_flo is an external service not an extension.

Maybe a benefit actually. Do you think we could use pg_flo with Postgres as a service instances like Azure postgres, Supabase, Neon etc? Like you just read the WAL without needing to install an extension that is not approved by the vendor.

[-]

shayonj a year ago

Yeah, absolutely! There are other benefits of being an extension, and yes you can use pg_flo with any PostgreSQL database or service since it uses logical replication to listen on changes and CTIDs for bulk copies.

oriavicii a year ago

Great tool! I've recently started experimenting with PostgreSQL logical replication as part of a project where we’re pruning data from our main database and archiving it to S3 for future use. However, some tables need to be available in a secondary Postgres instance specifically for read operations, which is where replication comes in.

I’m encountering a challenge due to two types of delete operations in the main database. The first delete operation is for data pruning, and I don’t want these deletions to be reflected in the replicated database. The second delete operation is for rollback purposes, which alters the data state, and these deletions should indeed be replicated to maintain consistency.

Is there a way to distinguish between these two delete operations so that only the rollback deletions are replicated? And would your tool support this kind of use case?

[-]

shayonj a year ago

Thanks for sharing. That is an interesting use case. I reckon you will be able to take advantage of the rules or routing feature in pg_flo. Is there some column that can help you identify the two separate types of deletion operations or perhaps time of the day?

If its easier, happy to also chat more on an issue in the repo or email (shayonj at gmail)

Dowwie a year ago

I'll evaluate this during my next CDC endeavor. Also on my list is Dozer: https://github.com/getdozer/dozer

[-]

tempest_ a year ago

If you click on getdozer.io you get redirected to https://langdb.ai/

Looks like they are pivoting to LLM integration.

shayonj a year ago

Thank you! Would love to receive feedback, if you are up for it.

dboreham a year ago

How does this related to disease control?

[-]

lhoff a year ago

Not op but i assume he is referring to change data capture.

gkhartman a year ago

I'm really looking forward to taking this for a spin. Thanks for sharing.

Question: Can this handle network disconnection/instability mid copy?

[-]

shayonj a year ago

Thank you! Today, an active stream is resumeable, say in case a shutdown / SIGTERM come up. It also stores some state every flush. However, as for copy-and-stream, it’s not quite resumable today, but that's coming soon.

Some more guardrails around network disruption and retries is on the way too.

distantsounds a year ago

https://github.com/eulerto/wal2json has existed for this purpose for many years, just transform the incoming records into JSON and parse with literally any standard tool that reads the format. There's a literal postgres extension for this.

davidarenas a year ago

How does it compare against something like Sequin? Looking into something that will allow a cdc event driven workflow but also arbitrarily replaying the event stream on demand.

[1] https://github.com/sequinstream/sequin

[-]

shayonj a year ago

Sequin sounds like a nice fit for that. I like it and yeah it does have replay.

With pg_flo, I am looking for the tool to be fairly hands off, while declarative at the same time for data filtering, transformation and re-routing like tasks, starting with PostgreSQL as the destination for now. The less time you can spend on cleaning, moving / syncing your data the better. Also quite curious to see what other use cases folks find it useful for.

Replay/reusability has been on my mind and have a sort of an old PR open. Will give it a shot soon.

Open to any suggestions and feedback.

rileymichael a year ago

Perfect timing — I was just looking into similar tools.

If I want to do a bulk copy (say, nightly) with various transformations but not continually stream after, is that supported / a use case that’d be a good fit for your tool?

[-]

zorgmonkey a year ago

It looks with what is currently implemented you'd have to drop the tables from the destination if you want repeated copies, probably not quite what you want. Close enough to your use case that it still might be worth testing it out though.

[-]

rileymichael a year ago

yeah that’s not an immediate deal breaker for me, I’m essentially looking for pgdump/restore + transformations. I’ll give it a look and see how it performs

[-]

shayonj a year ago

Thank you for giving this a spin. That's correct, today you'd need to drop the table before another sync with pg_flo. That said, I have given delta syncs some thought, and also looking into a control plane that can make some of these things easier. Would love to hear your feedback.

[-]

shayonj a year ago

I think this also might be of interest to you - just one time copies and apply transformations - https://github.com/shayonj/pg_flo/issues/6. I will look into shipping that very soon.

[-]

rileymichael a year ago

Oh awesome, yeah that’d work perfect then. I’d just prep beforehand and run the one-time copy.

[-]

shayonj a year ago

Heya! v0.0.4 is now out with `--copy` support, which will stream all the data from the table and the transformation/routing will get applied before the changes land in destination.

[1] https://github.com/shayonj/pg_flo?tab=readme-ov-file#streami...

[2] https://github.com/shayonj/pg_flo/releases

[-]

rileymichael a year ago

Awesome turn around — I’ll give it a go this week!

shayonj a year ago

amazing! Will plan a release by Tuesday latest. You can experiment with `--copy-and-stream ` in the meantime [1]

[1] https://github.com/shayonj/pg_flo?tab=readme-ov-file#streami...

seedless-sensat a year ago

Looks interesting. Just wondering, how did you decide to use NATS as the transport, instead of Kafka?

[-]

shayonj a year ago

Hi! thanks for checking it out. Mostly based on my own familiarity and interest at the time, especially with NATS Jetstream being out now. That said, depending on the feedback and roadmap evolves - I have thought about, say, supporting different intermediaries where it makes sense. The API interface is more/less left open from that perspective. I am also planning on also making NATS optional, for smaller and simpler setups.

Kerbonut a year ago

Would this be useful to upgrade major versions of PostgreSQL servers?

[-]

shayonj a year ago

Yeah, it would be, since its using logical replication. However, the standard logical replication limitation applies [1]

pg_flo can perform common DDLs for you (not something that's in built in logical replication today, hopefully soon) [2]

[1] https://www.postgresql.org/docs/current/logical-replication-...

[2] https://github.com/shayonj/pg_flo/blob/0e8b6b9ca1caf768b71d3...

oulipo a year ago

Would it be better to use replication rather than a backup on S3?

[-]

shayonj a year ago

It depends how you are backing up to S3 and in what format. If you are using something like pg_dump, this tool may not make sense. But if you want to say sync to S3 in parquet or something like that, then it might. Parquet support is TBD for now.

ibgeek a year ago

This is very cool!

[-]

shayonj a year ago

Thank you! Still very early days, would love to hear any feedback.