Durable execution, the hard way

(github.com)

41 points | by abelanger a day ago ago

3 comments

  • liampulles 7 minutes ago

    I think the 80/20 solution for reliable workflows is:

    - Ensure the workflow is idempotent - if it stops or fails at any point, you should be able to start it from scratch and skip / happily redo various elements.

    - Store the messages which trigger workflows.

    - Track failures (if your log aggregation is good, even that's enough to start).

    Then when the odd thing fails (or sometimes a bunch of things fail, because e.g. a core integration goes down) you can lookup the messages and have a little script or tool to go and re-queue them. This is an easy starting point that can keep you going for a long time until you really approach huge scale.

  • pkaler 2 hours ago

    I found the accompanying blog post excellent. In my experience, systems go from a monolith to a distributed monolith to a reliable distributed system. A durable workflow engine is one of the pieces that is required to get to target state.

    https://hatchet.run/blog/durable-execution