Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

(theverge.com)

126 points | by codebolt 4 hours ago ago

43 comments

I like that we can advertise to our customers that over the last X years we have better uptime than Amazon, google, etc.

[-]

eska 2 hours ago

Just yesterday I saw another Hetzner thread where someone claimed AWS beats them in uptime and someone else blasted AWS for huge incidents. I bet his coffee tastes better this morning.

[-]

VBprogrammer 2 hours ago

I honestly wonder if there is safety in the herd here. If you have a dedicated server in a rack somewhere that goes down and takes your site with it. Or even the whole data center has connectivity issues. As far as the customer is concerned, you screwed up.

If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.

[-]

BrentOzar 2 hours ago

> If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.

Exactly - I've had clients say, "We'll pay for hot standbys in the same region, but not in another region. If an entire AWS region goes down, it'll be in the news, and our customers will understand, because we won't be their only service provider that goes down, and our clients might even be down themselves."

[-]

bonesss an hour ago

Show up at a meeting looking like you wet yourself, it’s all anyone will ever talk about.

Show up at a meeting where a whole bunch of people appear to have wet themselves, and we’ll all agree not to mention it ever again…

tgv 2 hours ago

The Register calls it Microsoft 364, 363, ...

kingstnap 2 hours ago

Reported uptimes are little more than fabricated bullshit.

They measure uptime using averages of "if any part of a chain is even marginally working".

People experience downtime however as "if any part of a chain is degraded".

ivad 2 hours ago

Seems to have taken down my router "smart wifi" login page, and there's no backup router-only login option! Brilliant work, linksys....

[-]

hmry an hour ago

WiFi login portal (Icomera) on the train I'm on doesn't work either.

ta988 an hour ago

Happened to lots of commercial routers too (free wifi with sign-in pages in stores for example) and that's way outside us-east-1

[-]

kristopherleads 39 minutes ago

Was just on a Lufthansa and then United flight - both of which did not have WiFi. Was wondering if there was something going on at the infrastructure level.

shakesbeard 3 hours ago

Slack (canvas and huddles), Circle CI and Bitbucket are also reporting issues due to this.

CTDOCodebases an hour ago

I'm getting rate limit issues on Reddit so it could be related.

hobo_mark 3 hours ago

When did Snapchat move out of GCP?

[-]

freeqaz an hour ago

Since I'm 5+ years out from my NDA around this stuff, I'll give some high level details here.

Snapchat heavily used Google AppEngine to scale. This was basically a magical Java runtime that would 'hot path split' the monolithic service into lambda-like worker pools. Pretty crazy, but it worked well.

Snapchat leaned very heavily on this though and basically let Google build the tech that allowed them to scale up instead of dealing with that problem internally. At one point, Snap was >70% of all GCP usage. And this was almost all concentrated on ONE Java service. Nuts stuff.

Anyway, eventually Google was no longer happy with supporting this and the corporate way of breaking up is "hey we're gonna charge you 10x what did last year for this, kay?" (I don't know if it was actually 10x. It was just a LOT more)

So began the migration towards Kubernetes and AWS EKS. Snap was one of the pilot customers for EKS before it was generally available, iirc. (I helped work on this migration in 2018/2019)

Now, 6+ years later, I don't think Snap heavily uses GCP for traffic unless they migrated back. And this outage basically confirms that :P

[-]

garbthetill 28 minutes ago

Thats so interesting to me, I always assume companies like google who have "unlimited" dollars will always be happy to eat the cost to keep customers, especially given gcp usage outside googles internal services is way smaller compared to azure and aws. Also interesting to see snapchat had a hacky solution with AppEngine

[-]

makeitdouble 18 minutes ago

The "unlimited dollars" come from somewhere after all.

GCP is behind in market share, but has the incredible cheat advantage of just not being Amazon. Most retailers won't touch Amazon services with a ten foot pole, so the choice is GCP or Azure. Azure is way more painful for FOSS stacks, so GCP has its own area with only limited competition.

[-]

ecshafer 4 minutes ago

GCP as I understand it is the E-commerce/retail choice for this reason. Not Amazon being the main reason.

Honestly as a (very small) shareholder in Amazon, they should spin off AWS as a separate company. The Amazon brand is holding AWS back.

dijit 2 hours ago

They might have an implicit dependency on AWS, even if they're not primarily hosted there.

binsquare 4 hours ago

The internal disruption reviews are going to be fun :)

[-]

Msurrow 3 hours ago

The fun is really gonna start if the root cause of this somehow implicates an AI as a primary cause.

[-]

karel-3d 2 hours ago

I haven't seen the "90% of our code is AI" nonsense from Amazon.

portaouflop 2 hours ago

It’s gonna be DNS

[-]

WelcomeShorty an hour ago

Your remark made me laugh, but..:

"Oct 20 3:35 AM PDT The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution."

https://health.aws.amazon.com/health/status

erpellan 26 minutes ago

It’s always DNS! Except when it’s the firewall.

aurareturn 3 hours ago

It's never an AI's fault since it's up to a human to implement the AI and put in a process that prevents this stuff from happening.

So blame humans even if an AI wrote some bad code.

[-]

Msurrow 3 hours ago

I agree but then again it’s always a humans fault in the end. So probably a root cause will have a bit more neuance. I was more thinking of the possible headlines and how that would potentially affect the public AI debate. Since this event is big enough to actually get the attention of eg risk management at not-insignificant orgs.

[-]

xenocratus 3 hours ago

> but then again it’s always a humans fault in the end

Disagree, a human might be the cause/trigger, but the fault is pretty much always systemic. A whole lot of things has to happen for that last person to cause the problem.

[-]

Msurrow 2 hours ago

Also agree. “What” build the system though? (Humans)

Edit: and more important who governed the system, ie made decisions about maintainance, staffing, training, processes and so on

codebolt 2 hours ago

Atlassian cloud is also having issues. Closing in on the 3 hour mark.

moribvndvs 3 hours ago

So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.

[-]

lbreakjai 3 hours ago

Well at least you don't have to figure out how to test your setup locally.

ryanmcdonough 2 hours ago

Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?

[-]

spicybright 2 hours ago

It should be! When I was a complete newbie at AWS my first question was why do you have to pick a region, I thought the whole point was you didn't have to worry about that stuff

[-]

nicce 2 hours ago

As far as I know, region selection is about regulation and privacy and guarantees on that.

[-]

speedgoose an hour ago

The region labels found within the metadata are very very powerful.

They make lawyers happy and they stop intelligence services to access the associated resources.

For example, no one would even consider accessing data from a European region without the right paperwork.

[-]

speed_spread 27 minutes ago

Because if they were caught they'd have to pay _thousands_ of dollars in fines and get sternly talked to be high ranking officials.

sofixa 2 hours ago

> another data centre

Yes, within the same region. Doing stuff cross-region takes a little bit more effort and cost, so many skip it.

seeg 2 hours ago

quay.io was down: https://status.redhat.com

[-]

gramakri2 2 hours ago

quay.io is down

pantulis an hour ago

Now I know why the documents I was sending to my Kindle didn't go through.

XorNot 3 hours ago

Well that takes down Docker Hub as well it looks like.

[-]

AshLeece 3 hours ago

Yep, was just thinking the same when my Kubernetes failed a HelmRelease due to a pull error…