3 comments

  • rorymalcolm 8 hours ago

    (Full disclosure, I work at incident.io!)

    We recently released our On-call product, and as part of that, had to think a lot about redundancy and 'failing safety'.

    Here's how we achieve it - and how we're thinking about it. Interested if any other examples of this exist in the wild - I'd love to know more about how eg: Datadog achieve this.

  • lawrjone 7 hours ago

    Author here!

    It’s a fun problem to solve and one I’ve come across before when trying to alert on your monitoring tool being down, but slightly different when it’s your product.

    Hopefully interesting if you’ve hit similar puzzles before.

  • 8 hours ago
    [deleted]