Based on the number of times I've seen these posted about they seem quite frequent[0]. If I'm being honest, the entire BGP system seems to be very fragile with a massive blast radius. I get that it's super 'core' so it's hard to fix, and that it comes from a time when the Internet was more 'cooperative' (in the protocol sense of the word) but are there any attempts at a successor or is it impossible to do so fundamentally?
Surely the notion of who owns an AS should be cryptographically held so that an update has to be signed. Updates should be infrequent so the cost is felt on the control plane, not on the data plane.
I'm sure there's a BGPSec or whatever like all the other ${oldTech}Sec but I don't know if there is a realistic solution here or if it's IPv6 style tech.
Locally, BGP is peer-to-peer — literally! — and no particular peer is forced to check everything, and nobody's even trying to make a single global routing table so local agreements can override anything at a higher level.
A route leak is often like this: an ISP in Pakistan is ordered to censor YouTube, so they add a route internally to YouTube's IP addresses that passes to their censoring machine, or to nowhere. They accidentally have their edge routers configured to pass this route to all their connected networks instead of keeping it internally to themselves. Some of their peers recognize this as the shortest route to YouTube and install it into their own networks. Others recognize it's not the real YouTube and ignore it. Transit providers check route authorization more thoroughly than peers, so none of them accept it and the route doesn't spread globally.
That's like what, one major incident per month now, Nov 18, Dec 5, and now this one?
I'll bet JGC can write his own ticket by now, but unretiring would be really bad optics. He's on the board though and still keeping a watchful eye. But a couple more of these and CFs reputation will be in the gutter.
That’s what I also thought when I saw this incident. I wonder if there’s something up internally at Cloudflare or that it was always like this.
I feel like something such as a route leak should not be something that happens to Cloudflare. I’m surprised they set their systems up to allow this human error.
John left in April last year I think so it probably isn't directly related, so please take my comment in jest, but still it is worrisome, CF is in many ways 'too big to fail' and if this really becomes a regular thing it is going to cause a lot of people focused on their 'nines' to be pissed off.
> we pushed a change via our policy automation platform to remove the BGP announcements from Miami
Is there any way to test these changes against a simulation of real world routes? Including to ensure that traffic that shouldn’t hit Cloudflare servers, continues to resolve routes that don’t hit Cloudflare?
I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?
If there’s a meta-rule I think of when these incidents occur, it’s that configuration rules need change management, and change management is only as good as the level of automated testing. Just because code hasn’t changed doesn’t mean you shouldn’t test the baseline system behavior. And here, that means testing that the Internet works.
I don't know why you would need a tensor whatever. Dump the state of the router (which peers are connected and for how long what routes are they advertising and for how long) as well as the computed routing table and what routes are advertised to peers.
Set a simulation router to have the same state but a new config, and compute the routing table and what routes would he advertised to peers.
Confirm the diff in routing table and advertised routes is reasonable.
This change seemed to mostly be about a single location. Other BGP config changes leading to problems are often global changes, but you can check diffs and apply the config change one host at a time. You can't really make a simultaneous change anyway. Maybe one host changing is ok, but the Nth one causes a problem... CF has a lot of BGP routers, so maybe checking every diff is too much, but at least check a few.
Is that something out of the box on routers? I don't know, people with BGP routers never let me play with them. But given the BGP haiku, I'd want something like that before I messed around with things. For the price you pay for these fancy routers, you should be able to buy an extra few to run sandboxed config testing on. You could also simulate with open source bgp software, but the proprietary BGP daemon on the router might not act like the open source one does.
> Is there any way to test these changes against a simulation of real world routes? Including to ensure that traffic that shouldn’t hit Cloudflare servers, continues to resolve routes that don’t hit Cloudflare?
You can get access to view of routes from different parts of networks but you do not have access to those routers policies, so no
> I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?
Just simulating your peers and maybe layer after is most likely good enough. And you can probably do it with a bunch of cgroups and some actual routing software. There are also network sims like GNS3 that can even just run router images
I assume it's not possible unless you know the in-memory state of all the other gateway routers on the internet, no? You can know what they advertise, but that's not the same thing as a full description of their internal state and how they will choose to update if a route gets withdrawn.
I think you could know the state of the peers and simulate what they advertise and receive and validate that. The test unit would need to be a simulated router that behaves exactly as the real one, I actually think its technically doable with tight version control for routers.
We already have the tools to stop this from happening today. The problem is not the technology but the fact that companies do not want to work together to fix it. It is sad that we let the internet break because people are too slow to use the safety features we have.
I’m a huge fan of flapping when it’s really hard to do progressive rollouts. What this would mean here is you switch advertising the old and new routes back and forth automatically and this happens let’s say for 1 minute max before the old config is restored. Then a human looks at various metrics before they push a button to really make the new config permanent. It gives you a cheap way to preflight what will happen when you make a globally impacting config change.
I’m not sure this would be a good idea in this kind of change.
Flapping is bad in the networking world.
Flapping BGP routes, specifically, is bad because it can stress all BGP routers involved to the point where they can “go crazy”. Routes are explicitly advertised, so if you keep changing the routes, you are tasking the router CPU to process new stuff, discard it and process new stuff. In fact, BGP route flaps are specifically the focus of an entire RFC: https://datatracker.ietf.org/doc/html/rfc2439
More in general, a flapping link (on/off/on/off) can really mess with TCP.
Flapping in the networking world is not something you want to do intentionally.
I've had to read the RCA a couple of times to (probably) get what happened, even if I'm reasonably familiar with BGP.
Basically, my understanding (simplified) is:
- they originally had a Miami router advertise Bogota prefixes (=subnets) to Cloudflare's peers. Essentially, Miami was handling Bogota's subnets. This is not an issue.
- because you don't normally advertise arbitrary prefixes via BGP, policies were used. These policies are essentially if/then statements, carrying out certain actions (advertise or not, add some tags or remove them,...) if some conditions are matched. This is completely normal.
- Juniper router configuration for this kind of policy is (simplifying):
set <BGP POLICY NAME> from <CONDITION1>
set <BGP POLICY NAME> from <CONDITION2>
set <BGP POLICY NAME> then <ACTION1>
set <BGP POLICY NAME> then <ACTION2>
...
- prior to the incident, CF changed its network so that Miami didn't have to handle Bogota subnets (maybe Bogota does it on its own, maybe there's another router somewhere else)
- the change aimed at removing the configurations on Miami which were advertising Bogota subnets
- the change implementation essentially removed all lines from all policies containing "from IP in the list of Bogota prefixes". This is somewhat reasonable, because you could have the same policy handling both Bogota and, say, Quito prefixes, so you just want to remove the Bogota part.
HOWEVER, there was at least one policy like this:
(Before)
set <BGP POLICY NAME> from is_internal(prefix) == True
set <BGP POLICY NAME> from prefix in bogota_prefix_list
set <BGP POLICY NAME> then advertise
(After)
set <BGP POLICY NAME> from is_internal(prefix) == True
set <BGP POLICY NAME> then advertise
Which basically means: if you have an internal prefix advertise it
- an "internal prefix" is any prefix that was not received by another BGP entity (autonomous system)
- BGP routers in Cloudflare exchange routes to one another. This is again pretty normal.
- As a result of this change, all routes received by Miami through some other Cloudflare router were readvertised by Miami
- the result is CF telling the Internet (more accurately, its peers) "hey, you know that subnet? Go ask my Miami router!"
- obviously, this increases bandwidth utilization and latency for traffic crossing the Miami router.
I am not very familiar with Juniper config, but this phrase summarizes it well. "This means we (AS13335) took the prefix received from Meta (AS32934), our peer, and then advertised it toward Lumen (AS3356), one of our upstream transit providers. " basically you should not receive a prefix from an eBGP session ( different AS) and advertize to an eBGP session. As they mention at the next steps, good use of communities could help avoiding it, in case of other misconfigurations.
Based on the number of times I've seen these posted about they seem quite frequent[0]. If I'm being honest, the entire BGP system seems to be very fragile with a massive blast radius. I get that it's super 'core' so it's hard to fix, and that it comes from a time when the Internet was more 'cooperative' (in the protocol sense of the word) but are there any attempts at a successor or is it impossible to do so fundamentally?
Surely the notion of who owns an AS should be cryptographically held so that an update has to be signed. Updates should be infrequent so the cost is felt on the control plane, not on the data plane.
I'm sure there's a BGPSec or whatever like all the other ${oldTech}Sec but I don't know if there is a realistic solution here or if it's IPv6 style tech.
0: I looked it up before posting and it's 3000 leakers with 12 million leaks per quarter https://blog.qrator.net/en/q3-2022-ddos-attacks-and-bgp-inci...
Globally, it is as you want it to be.
Locally, BGP is peer-to-peer — literally! — and no particular peer is forced to check everything, and nobody's even trying to make a single global routing table so local agreements can override anything at a higher level.
I see. That makes sense.
A route leak is often like this: an ISP in Pakistan is ordered to censor YouTube, so they add a route internally to YouTube's IP addresses that passes to their censoring machine, or to nowhere. They accidentally have their edge routers configured to pass this route to all their connected networks instead of keeping it internally to themselves. Some of their peers recognize this as the shortest route to YouTube and install it into their own networks. Others recognize it's not the real YouTube and ignore it. Transit providers check route authorization more thoroughly than peers, so none of them accept it and the route doesn't spread globally.
That's like what, one major incident per month now, Nov 18, Dec 5, and now this one?
I'll bet JGC can write his own ticket by now, but unretiring would be really bad optics. He's on the board though and still keeping a watchful eye. But a couple more of these and CFs reputation will be in the gutter.
That’s what I also thought when I saw this incident. I wonder if there’s something up internally at Cloudflare or that it was always like this.
I feel like something such as a route leak should not be something that happens to Cloudflare. I’m surprised they set their systems up to allow this human error.
John left in April last year I think so it probably isn't directly related, so please take my comment in jest, but still it is worrisome, CF is in many ways 'too big to fail' and if this really becomes a regular thing it is going to cause a lot of people focused on their 'nines' to be pissed off.
I do appreciate these post mortems from Cloudflare, however I wish they would include timestamps of their status page posts in their timelines.
In this case, the timeline states "IMPACT STOP" was at 20:50 UTC and the first post to their status page was 12 minutes later at 21:02 UTC:
"Cloudflare experienced a Network Route leak, impacting performance for some networks beginning 20:25 UTC. We are working to mitigate impact."
> we pushed a change via our policy automation platform to remove the BGP announcements from Miami
Is there any way to test these changes against a simulation of real world routes? Including to ensure that traffic that shouldn’t hit Cloudflare servers, continues to resolve routes that don’t hit Cloudflare?
I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?
If there’s a meta-rule I think of when these incidents occur, it’s that configuration rules need change management, and change management is only as good as the level of automated testing. Just because code hasn’t changed doesn’t mean you shouldn’t test the baseline system behavior. And here, that means testing that the Internet works.
I don't know why you would need a tensor whatever. Dump the state of the router (which peers are connected and for how long what routes are they advertising and for how long) as well as the computed routing table and what routes are advertised to peers.
Set a simulation router to have the same state but a new config, and compute the routing table and what routes would he advertised to peers.
Confirm the diff in routing table and advertised routes is reasonable.
This change seemed to mostly be about a single location. Other BGP config changes leading to problems are often global changes, but you can check diffs and apply the config change one host at a time. You can't really make a simultaneous change anyway. Maybe one host changing is ok, but the Nth one causes a problem... CF has a lot of BGP routers, so maybe checking every diff is too much, but at least check a few.
Is that something out of the box on routers? I don't know, people with BGP routers never let me play with them. But given the BGP haiku, I'd want something like that before I messed around with things. For the price you pay for these fancy routers, you should be able to buy an extra few to run sandboxed config testing on. You could also simulate with open source bgp software, but the proprietary BGP daemon on the router might not act like the open source one does.
> Is there any way to test these changes against a simulation of real world routes? Including to ensure that traffic that shouldn’t hit Cloudflare servers, continues to resolve routes that don’t hit Cloudflare?
You can get access to view of routes from different parts of networks but you do not have access to those routers policies, so no
> I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?
Just simulating your peers and maybe layer after is most likely good enough. And you can probably do it with a bunch of cgroups and some actual routing software. There are also network sims like GNS3 that can even just run router images
You can cross-reference RADB, the RIRs, and looking glass servers, and you'd find 3 different pictures of the internet.
I assume it's not possible unless you know the in-memory state of all the other gateway routers on the internet, no? You can know what they advertise, but that's not the same thing as a full description of their internal state and how they will choose to update if a route gets withdrawn.
I think you could know the state of the peers and simulate what they advertise and receive and validate that. The test unit would need to be a simulated router that behaves exactly as the real one, I actually think its technically doable with tight version control for routers.
We already have the tools to stop this from happening today. The problem is not the technology but the fact that companies do not want to work together to fix it. It is sad that we let the internet break because people are too slow to use the safety features we have.
Their status pages were all green when we dealt with this.
Damn, I missed the fact Juniper was acquired by HPE, RIP
I’m a huge fan of flapping when it’s really hard to do progressive rollouts. What this would mean here is you switch advertising the old and new routes back and forth automatically and this happens let’s say for 1 minute max before the old config is restored. Then a human looks at various metrics before they push a button to really make the new config permanent. It gives you a cheap way to preflight what will happen when you make a globally impacting config change.
I’m not sure this would be a good idea in this kind of change.
Flapping is bad in the networking world.
Flapping BGP routes, specifically, is bad because it can stress all BGP routers involved to the point where they can “go crazy”. Routes are explicitly advertised, so if you keep changing the routes, you are tasking the router CPU to process new stuff, discard it and process new stuff. In fact, BGP route flaps are specifically the focus of an entire RFC: https://datatracker.ietf.org/doc/html/rfc2439
More in general, a flapping link (on/off/on/off) can really mess with TCP.
Flapping in the networking world is not something you want to do intentionally.
nice way to 100% the router CPUs for all your peers
Weak engineering. Both from the CloudFlare side and their peers.
I initially misread that as "Routine incident"
> and only affected IPv6 traffic
Why even bother to write an article about it then haha
The string of recent incidents don't really make the new CTO look good. Too much focus on shipping, not enough on shipping correctly.
Welcome to the age of AI-assisted coding.
I could have sworn "move fast and break things" existed before AI.
It did, but AI redefined the term "fast"
I've had to read the RCA a couple of times to (probably) get what happened, even if I'm reasonably familiar with BGP.
Basically, my understanding (simplified) is:
- they originally had a Miami router advertise Bogota prefixes (=subnets) to Cloudflare's peers. Essentially, Miami was handling Bogota's subnets. This is not an issue.
- because you don't normally advertise arbitrary prefixes via BGP, policies were used. These policies are essentially if/then statements, carrying out certain actions (advertise or not, add some tags or remove them,...) if some conditions are matched. This is completely normal.
- Juniper router configuration for this kind of policy is (simplifying):
set <BGP POLICY NAME> from <CONDITION1>
set <BGP POLICY NAME> from <CONDITION2>
set <BGP POLICY NAME> then <ACTION1>
set <BGP POLICY NAME> then <ACTION2>
...
- prior to the incident, CF changed its network so that Miami didn't have to handle Bogota subnets (maybe Bogota does it on its own, maybe there's another router somewhere else)
- the change aimed at removing the configurations on Miami which were advertising Bogota subnets
- the change implementation essentially removed all lines from all policies containing "from IP in the list of Bogota prefixes". This is somewhat reasonable, because you could have the same policy handling both Bogota and, say, Quito prefixes, so you just want to remove the Bogota part.
HOWEVER, there was at least one policy like this:
(Before)
set <BGP POLICY NAME> from is_internal(prefix) == True
set <BGP POLICY NAME> from prefix in bogota_prefix_list
set <BGP POLICY NAME> then advertise
(After)
set <BGP POLICY NAME> from is_internal(prefix) == True
set <BGP POLICY NAME> then advertise
Which basically means: if you have an internal prefix advertise it
- an "internal prefix" is any prefix that was not received by another BGP entity (autonomous system)
- BGP routers in Cloudflare exchange routes to one another. This is again pretty normal.
- As a result of this change, all routes received by Miami through some other Cloudflare router were readvertised by Miami
- the result is CF telling the Internet (more accurately, its peers) "hey, you know that subnet? Go ask my Miami router!"
- obviously, this increases bandwidth utilization and latency for traffic crossing the Miami router.
I am not very familiar with Juniper config, but this phrase summarizes it well. "This means we (AS13335) took the prefix received from Meta (AS32934), our peer, and then advertised it toward Lumen (AS3356), one of our upstream transit providers. " basically you should not receive a prefix from an eBGP session ( different AS) and advertize to an eBGP session. As they mention at the next steps, good use of communities could help avoiding it, in case of other misconfigurations.