It's not always DNS

(notes.pault.ag)

35 points | by todsacerdoti 8 hours ago ago

22 comments

The full maxim I was taught being, “it’s either DNS or permissions”.

The fatal design flaw for the Domain Name System was failure to learn from SCSI, viz. that it should always be possible to sacrifice a goat to whatever gods are necessary to receive a blessing of stability. It hardly remains to observe that animal sacrifice is non-normative for IETF standards-track documents and the consequences for distributed systems everywhere are plainly evident.

Goats notwithstanding, I think it is splitting hairs to suggest that the phrase “it’s always DNS” is erroneously reductive, merely because it does not explicitly convey that an adjacent control-plane mechanism updating the records may also be implicated. I don’t believe this aphorism drives a misconception that DNS itself is an inherently unreliable design. We’re not laughing it off to the extent of terminating further investigation, root-cause analysis, or subsequent reliability and consistency improvement.

More constructively, also observe that the industry standard joke book has another one covering us for this circumstance, viz. “There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of processing 2. Exactly-once delivery”

[-]

ammmir an hour ago

what is the connection with SCSI?

whatever1 2 hours ago

Why Computers engineers refuse to talk with manufacturing graybeards that operate critical systems at scale ?

The design shit I am seeing would not pass at a chemical plant not even a preliminary review.

prmoustache 4 hours ago

No, sometimes it is just Spanish football as for everything behind Cloudflare. Which is the case for this blog being blocked right now and redirecting to another page:

"El acceso a la presente dirección IP ha sido bloqueado en cumplimiento de lo dispuesto en la Sentencia de 18 de diciembre de 2024, dictada por el Juzgado de lo Mercantil nº 6 de Barcelona en el marco del procedimiento ordinario (Materia mercantil art. 249.1.4)-1005/2024-H instado por la Liga Nacional de Fútbol Profesional y por Telefónica Audiovisual Digital, S.L.U. https://www.laliga.com/noticias/nota-informativa-en-relacion..."

[-]

sodaclean 2 hours ago

It's intentional- If people can't use the internet they're more likely to watch the "game." For once management might have learned something from employees- take a dive, cry foul.

inlined an hour ago

Is this meant to be a defense of the DNS protocol? I’ve never assumed the meme was that the DNS protocol is flawed, but that these changes are particularly sensitive/dangerous.

At Google we noticed the main cause of outages are config changes. Does that mean external config is dangerous? Of course not! But it does remind you to be vigilant

teddyh 4 hours ago

> a DNSSEC rollout bricking prod for hours

He links to the Slack incident. But that problem wasn’t caused by a DNSSEC rollout; the problem was entirely caused by a completely botched attempt to back out of DNSSEC, by doing it the worst way possible.

[-]

tptacek 34 minutes ago

What's your point?

Spooky23 4 hours ago

Paul Tagliamonte sounds like a nice guy who has thought about these issues at length. He's reached the second level of DNS enlightenment: "There's no way it's DNS".

Finality will arrive, and Paul will internalize the knowledge.

FuriouslyAdrift 4 hours ago

Well sure... it could be BGP

sshine 4 hours ago

I had the CEO and CTO of our ccTLD registry give a guest lecture to my CS students, and one question came up regarding the AWS incident.

Prior to the question, the CEO boasted a 100% uptime (not just five nines), and the CTO said “We’re basically 30 people maintaining a 1GB text file.”

So the question was, “How come 30 people can have 100% uptime, and the biggest cloud with all of its expertise can’t? Sure, it was DNS, but are you even doing the same thing?”

And the answer was, (paraphrasing) “No, what we do is simple. They use DNS to solve all sorts of distributed problems.”

As did the CTO with all of these new record types embedding authentication. But running CoreDNS in a Kubernetes megacluster is not “maintaining a 1GB text file”.

[-]

hdgvhicv 3 hours ago

Maintaining uptime on complex systems is hard.

That’s why the best systems have as little complexity as possible

But that doesn’t help boost your resume or get a bonus.

jtbayly 5 hours ago

This is a beautifully designed page.

[-]

lucasban 4 hours ago

I wish it had a little bit more padding on mobile, but I agree otherwise

unilynx 4 hours ago

> but it is not the operational hazard it’s made out to be

Until you flip that DNSSEC toggle

ricudis 2 hours ago

It's always DNS, except when it's BGP.

kikoreis 5 hours ago

Resolver limitations, as opposed to server or protocol issues, are in my view the main reason why "it is always DNS".

sim7c00 8 hours ago

it could also be gamma rays or a variety of problems that seem to appear and disappear between chairs and keyboards.

memes are jokes. people taking jokes as something other is the problem.

bediger4000 5 hours ago

A lot of the time it's cabling.

ZebusJesus 4 hours ago

Tell that to AWS East 1

oliyoung 4 hours ago

Nope, the other times it's CORS

[-]

jongjong 2 hours ago

Though at least with CORS, once you actually get the damn thing working, it keeps working.