Ofc I wouldn't us it for extremely high scale event processing, but it's great default for a message/task queue for 90% of business apps. If you're processing under a few 100m events/tasks per day with less than ~10k concurrent processes dequeuing from it it's what I'd default to.
I work on apps that use such a PG based queue system and it provides indispensable features for us we couldn't achieve easily/cleanly with a normal queue system such as being able to dynamically adjust the priority/order of tasks being processed and easily query/report on the content of the queue. We have many other interesting features built into it that are more specific to our needs as well that I'm more hesitant to describe in detail here.
Biggest thing to watch out with this approach is that you will inevitably have some failure or bug that will 10x, 100x, or 1000x the rate of dead messages and that will overload your DLQ database. You need a circuit breaker or rate limit on it.
I worked on an app that sent an internal email with stack trace whenever an unhandled exception occurred. Worked great until the day when there was an OOM in a tight loop on a box in Asia that sent a few hundred emails per second and saturated the company WAN backbone and mailboxes of the whole team. Good times.
If you are reading from Kafka (for example) and you can't do anything with a message (broken json as an example) and you can't put it into a DLQ - you have not other option but to skip it or stop on it, no?
Your place of last resort with kafka is simply to replay the message back to the same kafka topic since you know it's up. In a simple single consumer setup just throw a retry count on the message and increment it to get monitoring/alerting/etc. Multi consumer? Put an enqueue source tag on it and only process the messages tagged for you. This won't scale to infinity but it scales really really far for really really cheap
Generally yes, but if you use e.g. the parallel consumer, you can potentially keep processing in that partition to avoid head-of-line blocking. There are some downsides to having a very old unprocessed record since it won't advance the consumer group's offset past that record, and it instead keeps track of the individual offsets it has completed beyond it, so you don't want to be in that state indefinitely, but you hope your DLQ eventually succeeds.
But if your DLQ is overloaded, you probably want to slow down or stop since sending a large fraction of your traffic to DLQ is counter productive. E.g. if you are sending 100% of messages to DLQ due to a bug, you should stop processing, fix the bug, and then resume from your normal queue.
Learned something new today. I knew what FOR UPDATE did, but somehow I've never RTFM'd hard enough to know about the SKIP LOCKED directive. Thats pretty cool.
Why use shedlock and select-for-update-skip-locked? Shedlock stops things running in parallel (sort-of), but the other thing makes parallel processing possible.
Postgres is essentially a b-tree with a remote interface. Would you use a b-tree to store a dead letter queue? What is big O of insert & delete? what happens when it grows?
Postgres has a query interface, replication, backup and many other great utilities. And it’s well supported, so it will work for low-demand applications.
Regardless, you’re using the wrong data structure with the wrong performance profile, and at the margins you will spend a lot more money and time than necessary running it . And service will suffer.
Another day, another “Using PostgreSQL for…” thing it wasn’t designed for. This isn’t a good idea. What happens when the queue goes down and all messages are dead lettered? What happens when you end up with competing messages? This is not the way.
The other system you're using that isn't Postgres can also go down.
Many developers overcomplicate systems. In the pursuit of 100% uptime, if you're not extremely careful, you removed more 9s with complexity than you added with redundancy. And although hyperscalers pride themselves on their uptime (Amazon even achieved three nines last year!) in reality most customers of most businesses are fine if your system is down for ten minutes a month. It's not ideal and you should probably fix that, but it's not catastrophic either.
What I’ve found is that, particularly with internal customers, they’re fine with an hour a month, possibly several, as long as not all of your eggs are in one basket.
The centralization pushes make a situation where if I have a task to do that needs three tools to accomplish, and one of them goes down, they’re all down. So all I can do is go for coffee or an early lunch because I can’t sub in another task into this time slot. They’re all blocked by The System being down, instead of a system being down.
If CI is borked I can work on docs and catch up on emails. If the network is down or NAS is down and everything is on that NAS, then things are dire.
There are a ton of job/queue systems out there that are based on SQL DBs. GoodJob and SupaBase Queues are two examples.
It’s not usable for high scale processing but most applications just need a simple queue with low depth and low complexity. If you’re already managing PSQL and don’t want to add more management to your stack (and managed services aren’t an option), this pattern works just fine. Go back 10-15yrs and it was more common, especially in Ruby shops, as teams willing to adopt Kafka/Cassandra/etc were more rare.
How so? There are queues that use SQL (or no-SQL) databases as the persistence layer. Your question is more specific to the implementation, not the database as persistence layer itself. And there are ways to address it.
Segment uses MySQL as queue not even as DLQ. It works at their scale. So there are many (not all) systems that can tolerate this as queue.
I have simple flow: tasks are order of thousands an hour. I just use postgresql. High visibility, easy requeue, durable store. With appropriate index, it’s perfectly fine. LLM will write skip locked code right first time. Easy local dev. I always reach for Postgres for event bus in low volume system.
Ofc I wouldn't us it for extremely high scale event processing, but it's great default for a message/task queue for 90% of business apps. If you're processing under a few 100m events/tasks per day with less than ~10k concurrent processes dequeuing from it it's what I'd default to.
I work on apps that use such a PG based queue system and it provides indispensable features for us we couldn't achieve easily/cleanly with a normal queue system such as being able to dynamically adjust the priority/order of tasks being processed and easily query/report on the content of the queue. We have many other interesting features built into it that are more specific to our needs as well that I'm more hesitant to describe in detail here.
Biggest thing to watch out with this approach is that you will inevitably have some failure or bug that will 10x, 100x, or 1000x the rate of dead messages and that will overload your DLQ database. You need a circuit breaker or rate limit on it.
I worked on an app that sent an internal email with stack trace whenever an unhandled exception occurred. Worked great until the day when there was an OOM in a tight loop on a box in Asia that sent a few hundred emails per second and saturated the company WAN backbone and mailboxes of the whole team. Good times.
This! Only thing worse than your main queue backing off is you dropping items from going into the DLQ because it can’t stay up.
If you can’t deliver to the DLQ, then what? Then you’re missing messages either way. Who cares if it’s down this way or the other?
Not necessarily. If you can't deliver the message somewhere you don't ACK it, and the sender can choose what to do (retry, backoff, etc.)
Sure, it's unavailability of course, but it's not data loss.
If you are reading from Kafka (for example) and you can't do anything with a message (broken json as an example) and you can't put it into a DLQ - you have not other option but to skip it or stop on it, no?
Your place of last resort with kafka is simply to replay the message back to the same kafka topic since you know it's up. In a simple single consumer setup just throw a retry count on the message and increment it to get monitoring/alerting/etc. Multi consumer? Put an enqueue source tag on it and only process the messages tagged for you. This won't scale to infinity but it scales really really far for really really cheap
Generally yes, but if you use e.g. the parallel consumer, you can potentially keep processing in that partition to avoid head-of-line blocking. There are some downsides to having a very old unprocessed record since it won't advance the consumer group's offset past that record, and it instead keeps track of the individual offsets it has completed beyond it, so you don't want to be in that state indefinitely, but you hope your DLQ eventually succeeds.
But if your DLQ is overloaded, you probably want to slow down or stop since sending a large fraction of your traffic to DLQ is counter productive. E.g. if you are sending 100% of messages to DLQ due to a bug, you should stop processing, fix the bug, and then resume from your normal queue.
The point is to not take the whole server down with it. Keeps the other applications working.
Sure, but you still need to design around this problem. It’s going to be a happy accident that everything turns out fine if you don’t.
> FOR UPDATE SKIP LOCKED
Learned something new today. I knew what FOR UPDATE did, but somehow I've never RTFM'd hard enough to know about the SKIP LOCKED directive. Thats pretty cool.
only learned about SKIP LOCKED because ChatGPT suggested it to solve some concurrency problem I had. Great tool to learn such things.
Great tool that wrote the blog post in the OP also, so it's quite versatile.
Why use shedlock and select-for-update-skip-locked? Shedlock stops things running in parallel (sort-of), but the other thing makes parallel processing possible.
Would be interesting to see the numbers this system processes. My bet is that they are not that high.
Postgres is essentially a b-tree with a remote interface. Would you use a b-tree to store a dead letter queue? What is big O of insert & delete? what happens when it grows?
Postgres has a query interface, replication, backup and many other great utilities. And it’s well supported, so it will work for low-demand applications.
Regardless, you’re using the wrong data structure with the wrong performance profile, and at the margins you will spend a lot more money and time than necessary running it . And service will suffer.
Another day, another “Using PostgreSQL for…” thing it wasn’t designed for. This isn’t a good idea. What happens when the queue goes down and all messages are dead lettered? What happens when you end up with competing messages? This is not the way.
The other system you're using that isn't Postgres can also go down.
Many developers overcomplicate systems. In the pursuit of 100% uptime, if you're not extremely careful, you removed more 9s with complexity than you added with redundancy. And although hyperscalers pride themselves on their uptime (Amazon even achieved three nines last year!) in reality most customers of most businesses are fine if your system is down for ten minutes a month. It's not ideal and you should probably fix that, but it's not catastrophic either.
What I’ve found is that, particularly with internal customers, they’re fine with an hour a month, possibly several, as long as not all of your eggs are in one basket.
The centralization pushes make a situation where if I have a task to do that needs three tools to accomplish, and one of them goes down, they’re all down. So all I can do is go for coffee or an early lunch because I can’t sub in another task into this time slot. They’re all blocked by The System being down, instead of a system being down.
If CI is borked I can work on docs and catch up on emails. If the network is down or NAS is down and everything is on that NAS, then things are dire.
There are a ton of job/queue systems out there that are based on SQL DBs. GoodJob and SupaBase Queues are two examples.
It’s not usable for high scale processing but most applications just need a simple queue with low depth and low complexity. If you’re already managing PSQL and don’t want to add more management to your stack (and managed services aren’t an option), this pattern works just fine. Go back 10-15yrs and it was more common, especially in Ruby shops, as teams willing to adopt Kafka/Cassandra/etc were more rare.
How so? There are queues that use SQL (or no-SQL) databases as the persistence layer. Your question is more specific to the implementation, not the database as persistence layer itself. And there are ways to address it.
Criticism without a better solution is only so valuable.
How would you do this instead, and why?
You wouldn't ack the message if you're not up to process it.
Segment uses MySQL as queue not even as DLQ. It works at their scale. So there are many (not all) systems that can tolerate this as queue.
I have simple flow: tasks are order of thousands an hour. I just use postgresql. High visibility, easy requeue, durable store. With appropriate index, it’s perfectly fine. LLM will write skip locked code right first time. Easy local dev. I always reach for Postgres for event bus in low volume system.
lol a FOR UPDATE SKIP LOCKED post hits the HN homepage every few months it feels like
and another CTO will use this meme as a reason to "just use Postgres" for far longer than they should lmao
https://github.com/pgmq/pgmq