I'm a tedious broken record about this (among many other things) but if you haven't read this Richard Cook piece, I strongly recommend you stop reading this postmortem and go read Cook's piece first. It won't take you long. It's the single best piece of writing about this topic I have ever read and I think the piece of technical writing that has done the most to change my thinking:
You can literally check off the things from Cook's piece that apply directly here. Also: when I wrote this comment, most of the thread was about root-causing the DNS thing that happened, which I don't think is the big story behind this outage. (Cook rejects the whole idea of a "root cause", and I'm pretty sure he's dead on right about why.)
Another great lens to see this is "Normal Accidents" theory, where the argument is made that the most dangerous systems are ones where components are very tightly coupled, interactions are complex and uncontrollable, and consequences of failure are serious.
He's literally writing about Three Mile Island. He doesn't have anything to tell you about what concurrency primitives to use for your distributed DNS management system.
But: given finite resources, should you respond to this incident by auditing your DNS management systems (or all your systems) for race conditions? Or should you instead figure out how to make the Droplet Manager survive (in some degraded state) a partition from DynamoDB without entering congestive collapse? Is the right response an identification of the "most faulty components" and a project plan to improve them? Or is it closing the human expertise/process gap that prevented them from throttling DWFM for 4.5 hours?
Cook isn't telling you how to solve problems; he's asking you to change how you think about problems, so you don't rathole in obvious local extrema instead of being guided by the bigger picture.
I appreciate the details this went through, especially laying out the exact timelines of operations and how overlaying those timelines produces unexpected effects. One of my all time favourite bits about distributed systems comes from the (legendary) talk at GDC - I Shot You First[1] - where the speaker describes drawing sequence diagrams with tilted arrows to represent the flow of time and asking "Where is the lag?". This method has saved me many times, all throughout my career from making games, to livestream and VoD services to now fintech. Always account for the flow of time when doing a distributed operation - time's arrow always marches forward, your systems might not.
But the stale read didn't scare me nearly as much as this quote:
> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues
Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient.
It shouldn't scare you. It should spark recognition. This meta-failure-mode exists in every complex technological system. You should be, like, "ah, of course, that makes sense now". Latent failures are fractally prevalent and have combinatoric potential to cause catastrophic failures. Yes, this is a runbook they need to have, but we should all understand there are an unbounded number of other runbooks they'll need and won't have, too!
the thing that scares me is that AI will never be able to diagnose an issue that it has never seen before. If there are no runbooks, there is no pattern recognition. this is something Ive been shouting about for 2 years now; hopefully this issue makes AWS leadership understand that current gen AI can never replace human engineering.
I'm much less confident in that assertion. I'm not bullish on AI systems independently taking over operations from humans, but catastrophic outages are combinations of less-catastrophic outages which are themselves combinations of latent failures, and when the latent failures are easy to characterize (as is the case here!), LLMs actually do really interesting stuff working out the combinatorics.
I wouldn't want to, like, make a company out of it (I assume the foundational model companies will eat all these businesses) but you could probably do some really interesting stuff with an agent that consumes telemetry and failure model information and uses it to surface hypos about what to look at or what interventions to consider.
All of this is besides my original point, though: I'm saying, you can't runbook your way to having a system as complex as AWS run safely. Safety in a system like that is a much more complicated process, unavoidably. Like: I don't think an LLM can solve the "fractal runbook requirement" problem!
So the DNS records if-stale-then-needs-update it was basically a variation of the "2 Hard Things In Computer Science - cache invalidation". Excerpt from the giant paragraph:
>[...] Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. [...]
It outlines some of the mechanics but some might think it still isn't a "Root Cause Analysis" because there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?!? Human error misconfiguration causing unintended delays in Enactor behavior?!? Either the previous sequence of events leading up to that is considered unimportant, or Amazon is still investigating what made Enactor behave in an unpredictable way.
This is public messaging to explain the problem at large. This isnt really a post incident analysis.
Before the active incident is “resolved” theres an evaluation of probable/plausible reoccurrence. Usually we/they would have potential mitigations and recovery runbooks prepared as well to quickly react to any reoccurance. Any likely open risks are actively worked to mitigate before the immediate issue is considered resolved. That includes around-the-clock dev team work if its the best known path to mitigation.
Next any plausible paths to “risk of reoccurance” would be top dev team priority (business hours) until those action items are completed and in deployment. That might include other teams with similar DIY DNS management, other teams who had less impactful queue depth problems, or other similar “near miss” findings. Service team tech & business owners (PE, Sr PE, GM, VP) would be tracking progress daily until resolved.
Then in the next few weeks at org & AWS level “ops meetings” there are going to be the in depth discussions of the incident, response, underlying problems, etc. the goal there being organizational learning and broader dissemination of lessons learned, action items, best practice etc.
my take away was that the race condition was the root cause. Take away that bug, and suddenly there's no incident, regardless of any processing delays.
Right.sounds like it’s a case of “rolling your own distributed system algorithm” without the up front investment in implementing a true robust distributed system.
Often network engineers are unaware of some of the tricky problems that DS research has addressed/solved in the last 50 years because the algorithms are arcane and heuristics often work pretty well, until they don’t. But my guess is that AWS will invest in some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates.
Consider this a nudge for all you engineers that are designing fault tolerant distributed systems at scale to investigate the problem spaces and know which algorithms solve what problems.
Further, please don’t stop at RAFT. RAFT is popular because it is easy to understand, not because it is the best way to do distributed consensus. It is non-deterministic (thus requiring odd numbers of electors), requires timeouts for liveness (thus latency can kill you), and isn’t all that good for general-purpose consensus, IMHO.
Why is the "DNS Planner" and "DNS Enactor" separate? If it was one thing, wouldn't this race condition have been much more clear to the people working on it? Is this caused by the explosion of complexity due to the over use of the microservice architecture?
Pick your battle I'd guess. Given how huge AWS is, if you have Desired state vs. reconciler, you probably have more resilient operations generally and a easier job of finding and isolating problems, the flip side of that is if you screw up your error handling, you get this. That aside, it seems strange to me they didn't account for the fact that a stale plan could get picked up over a new one, so maybe I misunderstand the incident/architecture.
> Why is the "DNS Planner" and "DNS Enactor" separate?
for a large system, it's in practice very nice to split up things like that - you have one bit of software that just reads a bunch of data and then emits a plan, and then another thing that just gets given a plan and executes it.
this is easier to test (you're just dealing with producing one data structure and consuming one data structure, the planner doesn't even try to mutate anything), it's easier to restrict permissions (one side only needs read access to the world!), it's easier to do upgrades (neither side depends on the other existing or even being in the same language), it's safer to operate (the planner is disposable, it can crash or be killed at any time with no problem except update latency), it's easier to comprehend (humans can examine the planner output which contains the entire state of the plan), it's easier to recover from weird states (you can in extremis hack the plan) etc etc. these are all things you appreciate more and more and your system gets bigger and more complicated.
> If it was one thing, wouldn't this race condition have been much more clear to the people working on it?
no
> Is this caused by the explosion of complexity due to the over use of the microservice architecture?
no
it's extremely easy to second-guess the way other people decompose their services since randoms online can't see any of the actual complexity or any of the details and so can easily suggest it would be better if it was different, without having to worry about any of the downsides of the imagined alternative solution.
Agreed, this is a common division of labor and simplifies things. It's not entirely clear in the postmortem but I speculate that the conflation of duties (i.e. the enactor also being responsible for janitor duty of stale plans) might have been a contributing factor.
I mean any time a service goes down even 1/100 the size of AWS you have people crawling out of the woodworks giving armchair advice while having no domain relevant experience. It's barely even worth taking the time to respond. The people with opinions of value are already giving them internally.
> The people with opinions of value are already giving them internally.
interesting take, in light of all the brain drain that AWS has experienced over the last few years. some outside opinions might be useful - but perhaps the brain drain is so extreme that those remaining don't realize it's occurring?
> ...there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?
Can't speak for the current incident but a similar "slow machine" issue once bit our BigCloud service (not as big an incident, thankfully) due to loooong JVM GC pauses on failing hardware.
Seems like the enactor should be checking the version/generation of the current record before it applies the new value, to ensure it never applies an old plan on top of an record updated by a new plan. It wouldn't be as efficient, but that's just how it is. It's a basic compare and swap operation, so it could be handled easily within dynamodb itself where these records are stored.
DNS-based CDNs are also effectively this: collect metrics from a datastore regarding system usage metrics, packet loss, latency etc and compute a table of viewer networks and preferred PoPs.
Unfortunately hard documentation is difficult to provide but that’s how a CDN worked at a place I used to work for, there’s also another CDN[1] which talks about the same thing in fancier terms.
From a meta analysis level: bugs will always happen, formal verification is hard, and sometimes it just takes a number of years to have some bad luck (I have hit bugs which were over 10 years old but due to low probability of them occurring they didn’t happen for a long time).
If we assume that the system will fail, I think the logical thing to think about is how to limit the effects of that failure. In practice this means cell based architecture, phased rollouts, and isolated zones.
To my knowledge AWS does attempt to implement cell based architecture, but there are some cross region dependencies specifically with us-east-1 due to legacy. The real long term fix for this is designing regions to be independent of each other.
This is a hard thing to do, but it is possible. I have personally been involved in disaster testing where a region was purposely firewalled off from the rest of the infrastructure. You find out very quick where those cross region dependencies lie, and many of them are in unexpected places.
Usually this work is not done due to lack of upper level VP support and funding, and it is easier to stick your head in the sand and hope bad things don’t happen. The strongest supporters of this work are going to be the share holders who are in it for the long run. If the company goes poof due to improper disaster testing, the shareholders are going to be the main bag holders. Making the board aware of the risks and the estimated probability of fundamentally company ending events can help get this work funded.
Sounds like they went with Availability over Correctness with this design but the problem is that if your core foundational config is not correct you get no availability either.
I think it makes sense in this instance. Because this occurred in us-east-1, the vast majority of affected customers are US based. For most people, it's easier to do the timezone conversion from PT than UTC.
us-east-1 is an exceptional Amazon region; it hosts many global services as well as services which are not yet available in other regions. Most AWS customers worldwide probably have an indirect dependency on us-east-1.
I gather, the root cause was a latent race condition in the DynamoDB DNS management system that allowed an outdated DNS plan to overwrite the current one, resulting in an empty DNS record for the regional endpoint.
I think you have to be careful with ideas like "the root cause". They underwent a metastable congestive collapse. A large component of the outage was them not having a runbook to safely recover an adequately performing state for their droplet manager service.
The precipitating event was a race condition with the DynamoDB planner/enactor system.
Why can't a race condition bug be seen as the single root cause? Yes, there were other factors that accelerated collapse, but those are inherent to DNS, which is outside the scope of a summary.
Because the DNS race condition is just one flaw in the system. The more important latent flaw† is probably the metastable failure mode for the droplet manager, which, when it loses connectivity to Dynamo, gradually itself loses connectivity with the Droplets, until a critical mass is hit where the Droplet manager has to be throttled and manually recovered.
Importantly: the DNS problem was resolved (to degraded state) in 1hr15, and fully resolved in 2hr30. The Droplet Manager problem took much longer!
This is the point of complex failure analysis, and why that school of thought says "root causing" is counterproductive. There will always be other precipitating events!
† which itself could very well be a second-order effect of some even deeper and more latent issue that would be more useful to address!
What is the purpose of identifying "root causes" in this model? Is the root cause of a memory corruption vulnerability holding a stale pointer to a freed value, or is it the lack of memory safety? Where does AWS gain more advantage: in identifying and mitigating metastable failure modes in EC2, or in trying to identify every possible way DNS might take down DynamoDB? (The latter is actually not an easy question, but that's the point!)
Two things can be important for an audience. For most, it's the race condition lesson. Locks are there for a reason. For AWS, it's the stability lesson. DNS can and did take down the empire for several hours.
Yeah, for better or worse, AWS is a huge dogfooder. It's nice to know they trust their stuff enough to depend on it themselves, but it's also scary to know that the blast radius of a failure in any particular service can be enormous
I was kinda surprised the lack of CAS on per-endpoint plan version or rejecting stale writes via 2PC or single-writer lease per endpoint like patterns.
Definitely a painful one with good learnings and kudos to AWS for being so transparent and detailed :hugops:
See https://news.ycombinator.com/item?id=45681136. The actual DNS mutation API does, effectively, CAS. They had multiple unsynchronized writers who raced without logical constraints or ordering to teh changes. Without thinking much they _might_ have been able to implement something like a vector either through updating the zone serial or another "sentinel record" that was always used for ChangeRRSets affecting that label/zone; like a TXT record containing a serialized change set number or a "checksum" of the old + new state.
Im guessing the "plans" aspect skipped that and they were just applying intended state, without trying serialize them. And last-write-wins, until it doesnt.
Interesting use of the phrase “Route53 transaction” for an operation that has no hard transactional guarantees. Especially given the lack of transactional updates are what caused the outage…
The fault was two different clients with divergent goal states:
- one ("old") DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints
- the DNS Planner continued to run and produced many newer generations of plans
[Ed: this is key: its producing "plans" of desired state, the does not include a complete transaction like a log or chain with previous state + mutations]
- one of the other ("new") DNS Enactors then began applying one of the newer plans
- then ("new") invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them [Ed: the key race is implied here. The "old" Enactor is reading _current state_, which was the output of "new", and applying its desired "old" state on top. The discrepency is because apparently Planer and Enactor aren't working with a chain/vector clock/serialized change set numbers/etc]
- At the same time the first ("old") Enactor ... applied its much older plan to the regional DDB endpoint, overwriting the newer plan. [Ed: and here is where "old" Enactor creates the valid ChangeRRSets call, replacing "new" with "old"]
- The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time [Ed: Whoops!]
- The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.
Ironically Route 53 does have strong transactions of API changes _and_ serializes them _and_ has closed loop observers to validate change sets globally on every dataplane host. So do other AWS services. And there are even some internal primitives for building replication or change set chains like this. But its also a PITA and takes a bunch of work and when it _does_ fail you end up with global deadlock and customers who are really grumpy that they dont see their DNS changes going in to effect.
Sounds like DynamoDB is going to continue to be a hard dependency for EC2, etc. I at least appreciate the transparency and hearing about their internal systems names.
I think it's time for AWS to pull the curtain back a bit and release a JSON document that shows a list of all internal service dependencies for each AWS service.
I mean, something has to be the baseline data storage layer. I’m more comfortable with it being DynamoDB than something else that isn’t pushed as hard by as many different customers.
> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues.
> as is the case with the recently launched IPv6 endpoint and the public regional endpoint
It isn't explicitly stated in the RCA but it is likely these new endpoints were the straw that broke the camel's back for the DynamoDB load balancer DNS automation
TLDR:
A DNS automation bug removed all the IP addresses for the regional endpoints. The tooling that was supposed to help with recovery depends on the system it needed to recover. That’s a classic “we deleted prod” failure mode at AWS scale.
The Bind resolver required each zone to have an increasing serial number for the zone.
So if you made a change you had to increase the number, usually a timestamp like 20250906114509 which would be older / lower numbered than 20250906114702; making it easier to determine which zone file had the newest data.
Seems like they sort of had the same setup but with less rigidity in terms of refusing to load older files.
I'm a tedious broken record about this (among many other things) but if you haven't read this Richard Cook piece, I strongly recommend you stop reading this postmortem and go read Cook's piece first. It won't take you long. It's the single best piece of writing about this topic I have ever read and I think the piece of technical writing that has done the most to change my thinking:
https://how.complexsystems.fail/
You can literally check off the things from Cook's piece that apply directly here. Also: when I wrote this comment, most of the thread was about root-causing the DNS thing that happened, which I don't think is the big story behind this outage. (Cook rejects the whole idea of a "root cause", and I'm pretty sure he's dead on right about why.)
Another great lens to see this is "Normal Accidents" theory, where the argument is made that the most dangerous systems are ones where components are very tightly coupled, interactions are complex and uncontrollable, and consequences of failure are serious.
https://en.wikipedia.org/wiki/Normal_Accidents
How does knowing this help you avoid these problems? It doesn’t seem to provide any guidance on what to do in the face of complex systems
He's literally writing about Three Mile Island. He doesn't have anything to tell you about what concurrency primitives to use for your distributed DNS management system.
But: given finite resources, should you respond to this incident by auditing your DNS management systems (or all your systems) for race conditions? Or should you instead figure out how to make the Droplet Manager survive (in some degraded state) a partition from DynamoDB without entering congestive collapse? Is the right response an identification of the "most faulty components" and a project plan to improve them? Or is it closing the human expertise/process gap that prevented them from throttling DWFM for 4.5 hours?
Cook isn't telling you how to solve problems; he's asking you to change how you think about problems, so you don't rathole in obvious local extrema instead of being guided by the bigger picture.
I appreciate the details this went through, especially laying out the exact timelines of operations and how overlaying those timelines produces unexpected effects. One of my all time favourite bits about distributed systems comes from the (legendary) talk at GDC - I Shot You First[1] - where the speaker describes drawing sequence diagrams with tilted arrows to represent the flow of time and asking "Where is the lag?". This method has saved me many times, all throughout my career from making games, to livestream and VoD services to now fintech. Always account for the flow of time when doing a distributed operation - time's arrow always marches forward, your systems might not.
But the stale read didn't scare me nearly as much as this quote:
> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues
Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient.
[1] https://youtu.be/h47zZrqjgLc?t=1587
It shouldn't scare you. It should spark recognition. This meta-failure-mode exists in every complex technological system. You should be, like, "ah, of course, that makes sense now". Latent failures are fractally prevalent and have combinatoric potential to cause catastrophic failures. Yes, this is a runbook they need to have, but we should all understand there are an unbounded number of other runbooks they'll need and won't have, too!
the thing that scares me is that AI will never be able to diagnose an issue that it has never seen before. If there are no runbooks, there is no pattern recognition. this is something Ive been shouting about for 2 years now; hopefully this issue makes AWS leadership understand that current gen AI can never replace human engineering.
I'm much less confident in that assertion. I'm not bullish on AI systems independently taking over operations from humans, but catastrophic outages are combinations of less-catastrophic outages which are themselves combinations of latent failures, and when the latent failures are easy to characterize (as is the case here!), LLMs actually do really interesting stuff working out the combinatorics.
I wouldn't want to, like, make a company out of it (I assume the foundational model companies will eat all these businesses) but you could probably do some really interesting stuff with an agent that consumes telemetry and failure model information and uses it to surface hypos about what to look at or what interventions to consider.
All of this is besides my original point, though: I'm saying, you can't runbook your way to having a system as complex as AWS run safely. Safety in a system like that is a much more complicated process, unavoidably. Like: I don't think an LLM can solve the "fractal runbook requirement" problem!
So the DNS records if-stale-then-needs-update it was basically a variation of the "2 Hard Things In Computer Science - cache invalidation". Excerpt from the giant paragraph:
>[...] Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. [...]
It outlines some of the mechanics but some might think it still isn't a "Root Cause Analysis" because there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?!? Human error misconfiguration causing unintended delays in Enactor behavior?!? Either the previous sequence of events leading up to that is considered unimportant, or Amazon is still investigating what made Enactor behave in an unpredictable way.
This is public messaging to explain the problem at large. This isnt really a post incident analysis.
Before the active incident is “resolved” theres an evaluation of probable/plausible reoccurrence. Usually we/they would have potential mitigations and recovery runbooks prepared as well to quickly react to any reoccurance. Any likely open risks are actively worked to mitigate before the immediate issue is considered resolved. That includes around-the-clock dev team work if its the best known path to mitigation.
Next any plausible paths to “risk of reoccurance” would be top dev team priority (business hours) until those action items are completed and in deployment. That might include other teams with similar DIY DNS management, other teams who had less impactful queue depth problems, or other similar “near miss” findings. Service team tech & business owners (PE, Sr PE, GM, VP) would be tracking progress daily until resolved.
Then in the next few weeks at org & AWS level “ops meetings” there are going to be the in depth discussions of the incident, response, underlying problems, etc. the goal there being organizational learning and broader dissemination of lessons learned, action items, best practice etc.
my take away was that the race condition was the root cause. Take away that bug, and suddenly there's no incident, regardless of any processing delays.
Right.sounds like it’s a case of “rolling your own distributed system algorithm” without the up front investment in implementing a true robust distributed system.
Often network engineers are unaware of some of the tricky problems that DS research has addressed/solved in the last 50 years because the algorithms are arcane and heuristics often work pretty well, until they don’t. But my guess is that AWS will invest in some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates.
Consider this a nudge for all you engineers that are designing fault tolerant distributed systems at scale to investigate the problem spaces and know which algorithms solve what problems.
Further, please don’t stop at RAFT. RAFT is popular because it is easy to understand, not because it is the best way to do distributed consensus. It is non-deterministic (thus requiring odd numbers of electors), requires timeouts for liveness (thus latency can kill you), and isn’t all that good for general-purpose consensus, IMHO.
> some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates
Reading these words makes me break out in cold sweat :-) I really hope they don't
Certainly seems like misuse of DNS. It wasn't designed to be a rapidly updatable consistent distributed database.
Why is the "DNS Planner" and "DNS Enactor" separate? If it was one thing, wouldn't this race condition have been much more clear to the people working on it? Is this caused by the explosion of complexity due to the over use of the microservice architecture?
It probably was a single-threaded python script until somebody found a way to get a Promo out of it.
Pick your battle I'd guess. Given how huge AWS is, if you have Desired state vs. reconciler, you probably have more resilient operations generally and a easier job of finding and isolating problems, the flip side of that is if you screw up your error handling, you get this. That aside, it seems strange to me they didn't account for the fact that a stale plan could get picked up over a new one, so maybe I misunderstand the incident/architecture.
> Why is the "DNS Planner" and "DNS Enactor" separate?
for a large system, it's in practice very nice to split up things like that - you have one bit of software that just reads a bunch of data and then emits a plan, and then another thing that just gets given a plan and executes it.
this is easier to test (you're just dealing with producing one data structure and consuming one data structure, the planner doesn't even try to mutate anything), it's easier to restrict permissions (one side only needs read access to the world!), it's easier to do upgrades (neither side depends on the other existing or even being in the same language), it's safer to operate (the planner is disposable, it can crash or be killed at any time with no problem except update latency), it's easier to comprehend (humans can examine the planner output which contains the entire state of the plan), it's easier to recover from weird states (you can in extremis hack the plan) etc etc. these are all things you appreciate more and more and your system gets bigger and more complicated.
> If it was one thing, wouldn't this race condition have been much more clear to the people working on it?
no
> Is this caused by the explosion of complexity due to the over use of the microservice architecture?
no
it's extremely easy to second-guess the way other people decompose their services since randoms online can't see any of the actual complexity or any of the details and so can easily suggest it would be better if it was different, without having to worry about any of the downsides of the imagined alternative solution.
Agreed, this is a common division of labor and simplifies things. It's not entirely clear in the postmortem but I speculate that the conflation of duties (i.e. the enactor also being responsible for janitor duty of stale plans) might have been a contributing factor.
The Oxide and Friends folks covered an update system they built that is similarly split and they cite a number of the same benefits as you: https://oxide-and-friends.transistor.fm/episodes/systems-sof...
I mean any time a service goes down even 1/100 the size of AWS you have people crawling out of the woodworks giving armchair advice while having no domain relevant experience. It's barely even worth taking the time to respond. The people with opinions of value are already giving them internally.
> The people with opinions of value are already giving them internally.
interesting take, in light of all the brain drain that AWS has experienced over the last few years. some outside opinions might be useful - but perhaps the brain drain is so extreme that those remaining don't realize it's occurring?
> ...there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?
Can't speak for the current incident but a similar "slow machine" issue once bit our BigCloud service (not as big an incident, thankfully) due to loooong JVM GC pauses on failing hardware.
Also, I don't know if I missed it, but they don't establish anything to prevent outage if there's unusually high delay again?
It’s at the end, they disabled the DDB DNS automations around this to fix before they re-enable them
Seems like the enactor should be checking the version/generation of the current record before it applies the new value, to ensure it never applies an old plan on top of an record updated by a new plan. It wouldn't be as efficient, but that's just how it is. It's a basic compare and swap operation, so it could be handled easily within dynamodb itself where these records are stored.
>Services like DynamoDB maintain hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region
Does that mean a DNS query for dynamodb.us-east-1.amazonaws.com can resolve to one of a hundred thousand IP address?
That's insane!
And also well beyond the limits of route53.
I'm wondering if they're constantly updating route53 with a smaller subset of records and using a low ttl to somewhat work around this.
DNS-based CDNs are also effectively this: collect metrics from a datastore regarding system usage metrics, packet loss, latency etc and compute a table of viewer networks and preferred PoPs.
Unfortunately hard documentation is difficult to provide but that’s how a CDN worked at a place I used to work for, there’s also another CDN[1] which talks about the same thing in fancier terms.
[1] https://bunny.net/network/smartedge/
From a meta analysis level: bugs will always happen, formal verification is hard, and sometimes it just takes a number of years to have some bad luck (I have hit bugs which were over 10 years old but due to low probability of them occurring they didn’t happen for a long time).
If we assume that the system will fail, I think the logical thing to think about is how to limit the effects of that failure. In practice this means cell based architecture, phased rollouts, and isolated zones.
To my knowledge AWS does attempt to implement cell based architecture, but there are some cross region dependencies specifically with us-east-1 due to legacy. The real long term fix for this is designing regions to be independent of each other.
This is a hard thing to do, but it is possible. I have personally been involved in disaster testing where a region was purposely firewalled off from the rest of the infrastructure. You find out very quick where those cross region dependencies lie, and many of them are in unexpected places.
Usually this work is not done due to lack of upper level VP support and funding, and it is easier to stick your head in the sand and hope bad things don’t happen. The strongest supporters of this work are going to be the share holders who are in it for the long run. If the company goes poof due to improper disaster testing, the shareholders are going to be the main bag holders. Making the board aware of the risks and the estimated probability of fundamentally company ending events can help get this work funded.
Sounds like they went with Availability over Correctness with this design but the problem is that if your core foundational config is not correct you get no availability either.
I believe a report with timezone not using UTC is a crime.
I think it makes sense in this instance. Because this occurred in us-east-1, the vast majority of affected customers are US based. For most people, it's easier to do the timezone conversion from PT than UTC.
us-east-1 is an exceptional Amazon region; it hosts many global services as well as services which are not yet available in other regions. Most AWS customers worldwide probably have an indirect dependency on us-east-1.
An epoch fail?
My guess is that PT was chosen to highlight the fact that this happened in the middle of the night for most of the responding ops folks.
(I don't know anything here, just spitballing why that choice would be made)
Their headquarters is in Seattle (Pacific Time.) But yeah, I hate time zones.
I gather, the root cause was a latent race condition in the DynamoDB DNS management system that allowed an outdated DNS plan to overwrite the current one, resulting in an empty DNS record for the regional endpoint.
Correct?
I think you have to be careful with ideas like "the root cause". They underwent a metastable congestive collapse. A large component of the outage was them not having a runbook to safely recover an adequately performing state for their droplet manager service.
The precipitating event was a race condition with the DynamoDB planner/enactor system.
https://how.complexsystems.fail/
Why can't a race condition bug be seen as the single root cause? Yes, there were other factors that accelerated collapse, but those are inherent to DNS, which is outside the scope of a summary.
Because the DNS race condition is just one flaw in the system. The more important latent flaw† is probably the metastable failure mode for the droplet manager, which, when it loses connectivity to Dynamo, gradually itself loses connectivity with the Droplets, until a critical mass is hit where the Droplet manager has to be throttled and manually recovered.
Importantly: the DNS problem was resolved (to degraded state) in 1hr15, and fully resolved in 2hr30. The Droplet Manager problem took much longer!
This is the point of complex failure analysis, and why that school of thought says "root causing" is counterproductive. There will always be other precipitating events!
† which itself could very well be a second-order effect of some even deeper and more latent issue that would be more useful to address!
https://en.wikipedia.org/wiki/Swiss_cheese_model
Two different questions here.
1. How did it break?
2. Why did it collapse?
A1: Race condition
A2: What you said.
What is the purpose of identifying "root causes" in this model? Is the root cause of a memory corruption vulnerability holding a stale pointer to a freed value, or is it the lack of memory safety? Where does AWS gain more advantage: in identifying and mitigating metastable failure modes in EC2, or in trying to identify every possible way DNS might take down DynamoDB? (The latter is actually not an easy question, but that's the point!)
Two things can be important for an audience. For most, it's the race condition lesson. Locks are there for a reason. For AWS, it's the stability lesson. DNS can and did take down the empire for several hours.
Did DNS take it down, or did a pattern of latent failures take it down? DNS was restored fairly quickly!
Nobody is saying that locks aren't interesting or important.
Had no idea Dynamo was so intertwined with the whole AWS stack.
Yeah, for better or worse, AWS is a huge dogfooder. It's nice to know they trust their stuff enough to depend on it themselves, but it's also scary to know that the blast radius of a failure in any particular service can be enormous
I was kinda surprised the lack of CAS on per-endpoint plan version or rejecting stale writes via 2PC or single-writer lease per endpoint like patterns.
Definitely a painful one with good learnings and kudos to AWS for being so transparent and detailed :hugops:
See https://news.ycombinator.com/item?id=45681136. The actual DNS mutation API does, effectively, CAS. They had multiple unsynchronized writers who raced without logical constraints or ordering to teh changes. Without thinking much they _might_ have been able to implement something like a vector either through updating the zone serial or another "sentinel record" that was always used for ChangeRRSets affecting that label/zone; like a TXT record containing a serialized change set number or a "checksum" of the old + new state.
Im guessing the "plans" aspect skipped that and they were just applying intended state, without trying serialize them. And last-write-wins, until it doesnt.
Interesting use of the phrase “Route53 transaction” for an operation that has no hard transactional guarantees. Especially given the lack of transactional updates are what caused the outage…
I think you misunderstnad the failure case. The ChangeResourceRecordSet is transactional (or was when I worked on the service) https://docs.aws.amazon.com/Route53/latest/APIReference/API_....
The fault was two different clients with divergent goal states:
- one ("old") DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints
- the DNS Planner continued to run and produced many newer generations of plans [Ed: this is key: its producing "plans" of desired state, the does not include a complete transaction like a log or chain with previous state + mutations]
- one of the other ("new") DNS Enactors then began applying one of the newer plans
- then ("new") invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them [Ed: the key race is implied here. The "old" Enactor is reading _current state_, which was the output of "new", and applying its desired "old" state on top. The discrepency is because apparently Planer and Enactor aren't working with a chain/vector clock/serialized change set numbers/etc]
- At the same time the first ("old") Enactor ... applied its much older plan to the regional DDB endpoint, overwriting the newer plan. [Ed: and here is where "old" Enactor creates the valid ChangeRRSets call, replacing "new" with "old"]
- The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time [Ed: Whoops!]
- The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.
Ironically Route 53 does have strong transactions of API changes _and_ serializes them _and_ has closed loop observers to validate change sets globally on every dataplane host. So do other AWS services. And there are even some internal primitives for building replication or change set chains like this. But its also a PITA and takes a bunch of work and when it _does_ fail you end up with global deadlock and customers who are really grumpy that they dont see their DNS changes going in to effect.
Not for nothing, there’s a support group for those of us who’ve been hurt by WHU sev2s…
This is unreadable and terribly formatted.
Sounds like DynamoDB is going to continue to be a hard dependency for EC2, etc. I at least appreciate the transparency and hearing about their internal systems names.
I think it's time for AWS to pull the curtain back a bit and release a JSON document that shows a list of all internal service dependencies for each AWS service.
Would it matter? Would you base decisions on whether or not to use one of their products based on the dependency graph?
Yes.
I mean, something has to be the baseline data storage layer. I’m more comfortable with it being DynamoDB than something else that isn’t pushed as hard by as many different customers.
So the root cause is basically race condition 101 stale read ?
Race condition and bad data validation.
does DynamoDB run on EC2? if I read it right, EC2 depends on DynamoDB.
There are circular dependencies within AWS, but also systems to account for this (especially for cold starting).
Also there really is no one AWS, each region is its own (Now more then ever before, where some systems weren't built to support this).
> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues.
interesting.
Is it the internal dynamodb that other people use?
> as is the case with the recently launched IPv6 endpoint and the public regional endpoint
It isn't explicitly stated in the RCA but it is likely these new endpoints were the straw that broke the camel's back for the DynamoDB load balancer DNS automation
Would conditional read/write solve this? looks like some kind of stale read
TLDR: A DNS automation bug removed all the IP addresses for the regional endpoints. The tooling that was supposed to help with recovery depends on the system it needed to recover. That’s a classic “we deleted prod” failure mode at AWS scale.
The Bind resolver required each zone to have an increasing serial number for the zone.
So if you made a change you had to increase the number, usually a timestamp like 20250906114509 which would be older / lower numbered than 20250906114702; making it easier to determine which zone file had the newest data.
Seems like they sort of had the same setup but with less rigidity in terms of refusing to load older files.