Just yesterday I saw another Hetzner thread where someone claimed AWS beats them in uptime and someone else blasted AWS for huge incidents. I bet his coffee tastes better this morning.
I honestly wonder if there is safety in the herd here. If you have a dedicated server in a rack somewhere that goes down and takes your site with it. Or even the whole data center has connectivity issues. As far as the customer is concerned, you screwed up.
If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
> If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
Exactly - I've had clients say, "We'll pay for hot standbys in the same region, but not in another region. If an entire AWS region goes down, it'll be in the news, and our customers will understand, because we won't be their only service provider that goes down, and our clients might even be down themselves."
Was just on a Lufthansa and then United flight - both of which did not have WiFi. Was wondering if there was something going on at the infrastructure level.
Since I'm 5+ years out from my NDA around this stuff, I'll give some high level details here.
Snapchat heavily used Google AppEngine to scale. This was basically a magical Java runtime that would 'hot path split' the monolithic service into lambda-like worker pools. Pretty crazy, but it worked well.
Snapchat leaned very heavily on this though and basically let Google build the tech that allowed them to scale up instead of dealing with that problem internally. At one point, Snap was >70% of all GCP usage. And this was almost all concentrated on ONE Java service. Nuts stuff.
Anyway, eventually Google was no longer happy with supporting this and the corporate way of breaking up is "hey we're gonna charge you 10x what did last year for this, kay?" (I don't know if it was actually 10x. It was just a LOT more)
So began the migration towards Kubernetes and AWS EKS. Snap was one of the pilot customers for EKS before it was generally available, iirc. (I helped work on this migration in 2018/2019)
Now, 6+ years later, I don't think Snap heavily uses GCP for traffic unless they migrated back. And this outage basically confirms that :P
Thats so interesting to me, I always assume companies like google who have "unlimited" dollars will always be happy to eat the cost to keep customers, especially given gcp usage outside googles internal services is way smaller compared to azure and aws. Also interesting to see snapchat had a hacky solution with AppEngine
The "unlimited dollars" come from somewhere after all.
GCP is behind in market share, but has the incredible cheat advantage of just not being Amazon. Most retailers won't touch Amazon services with a ten foot pole, so the choice is GCP or Azure. Azure is way more painful for FOSS stacks, so GCP has its own area with only limited competition.
"Oct 20 3:35 AM PDT The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution."
I agree but then again it’s always a humans fault in the end. So probably a root cause will have a bit more neuance. I was more thinking of the possible headlines and how that would potentially affect the public AI debate. Since this event is big enough to actually get the attention of eg risk management at not-insignificant orgs.
> but then again it’s always a humans fault in the end
Disagree, a human might be the cause/trigger, but the fault is pretty much always systemic. A whole lot of things has to happen for that last person to cause the problem.
So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.
It should be! When I was a complete newbie at AWS my first question was why do you have to pick a region, I thought the whole point was you didn't have to worry about that stuff
I like that we can advertise to our customers that over the last X years we have better uptime than Amazon, google, etc.
Just yesterday I saw another Hetzner thread where someone claimed AWS beats them in uptime and someone else blasted AWS for huge incidents. I bet his coffee tastes better this morning.
I honestly wonder if there is safety in the herd here. If you have a dedicated server in a rack somewhere that goes down and takes your site with it. Or even the whole data center has connectivity issues. As far as the customer is concerned, you screwed up.
If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
> If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
Exactly - I've had clients say, "We'll pay for hot standbys in the same region, but not in another region. If an entire AWS region goes down, it'll be in the news, and our customers will understand, because we won't be their only service provider that goes down, and our clients might even be down themselves."
Show up at a meeting looking like you wet yourself, it’s all anyone will ever talk about.
Show up at a meeting where a whole bunch of people appear to have wet themselves, and we’ll all agree not to mention it ever again…
The Register calls it Microsoft 364, 363, ...
Reported uptimes are little more than fabricated bullshit.
They measure uptime using averages of "if any part of a chain is even marginally working".
People experience downtime however as "if any part of a chain is degraded".
Seems to have taken down my router "smart wifi" login page, and there's no backup router-only login option! Brilliant work, linksys....
WiFi login portal (Icomera) on the train I'm on doesn't work either.
Happened to lots of commercial routers too (free wifi with sign-in pages in stores for example) and that's way outside us-east-1
Was just on a Lufthansa and then United flight - both of which did not have WiFi. Was wondering if there was something going on at the infrastructure level.
Slack (canvas and huddles), Circle CI and Bitbucket are also reporting issues due to this.
I'm getting rate limit issues on Reddit so it could be related.
When did Snapchat move out of GCP?
Since I'm 5+ years out from my NDA around this stuff, I'll give some high level details here.
Snapchat heavily used Google AppEngine to scale. This was basically a magical Java runtime that would 'hot path split' the monolithic service into lambda-like worker pools. Pretty crazy, but it worked well.
Snapchat leaned very heavily on this though and basically let Google build the tech that allowed them to scale up instead of dealing with that problem internally. At one point, Snap was >70% of all GCP usage. And this was almost all concentrated on ONE Java service. Nuts stuff.
Anyway, eventually Google was no longer happy with supporting this and the corporate way of breaking up is "hey we're gonna charge you 10x what did last year for this, kay?" (I don't know if it was actually 10x. It was just a LOT more)
So began the migration towards Kubernetes and AWS EKS. Snap was one of the pilot customers for EKS before it was generally available, iirc. (I helped work on this migration in 2018/2019)
Now, 6+ years later, I don't think Snap heavily uses GCP for traffic unless they migrated back. And this outage basically confirms that :P
Thats so interesting to me, I always assume companies like google who have "unlimited" dollars will always be happy to eat the cost to keep customers, especially given gcp usage outside googles internal services is way smaller compared to azure and aws. Also interesting to see snapchat had a hacky solution with AppEngine
The "unlimited dollars" come from somewhere after all.
GCP is behind in market share, but has the incredible cheat advantage of just not being Amazon. Most retailers won't touch Amazon services with a ten foot pole, so the choice is GCP or Azure. Azure is way more painful for FOSS stacks, so GCP has its own area with only limited competition.
GCP as I understand it is the E-commerce/retail choice for this reason. Not Amazon being the main reason.
Honestly as a (very small) shareholder in Amazon, they should spin off AWS as a separate company. The Amazon brand is holding AWS back.
They might have an implicit dependency on AWS, even if they're not primarily hosted there.
The internal disruption reviews are going to be fun :)
The fun is really gonna start if the root cause of this somehow implicates an AI as a primary cause.
I haven't seen the "90% of our code is AI" nonsense from Amazon.
It’s gonna be DNS
Your remark made me laugh, but..:
"Oct 20 3:35 AM PDT The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution."
https://health.aws.amazon.com/health/status
It’s always DNS! Except when it’s the firewall.
It's never an AI's fault since it's up to a human to implement the AI and put in a process that prevents this stuff from happening.
So blame humans even if an AI wrote some bad code.
I agree but then again it’s always a humans fault in the end. So probably a root cause will have a bit more neuance. I was more thinking of the possible headlines and how that would potentially affect the public AI debate. Since this event is big enough to actually get the attention of eg risk management at not-insignificant orgs.
> but then again it’s always a humans fault in the end
Disagree, a human might be the cause/trigger, but the fault is pretty much always systemic. A whole lot of things has to happen for that last person to cause the problem.
Also agree. “What” build the system though? (Humans)
Edit: and more important who governed the system, ie made decisions about maintainance, staffing, training, processes and so on
Atlassian cloud is also having issues. Closing in on the 3 hour mark.
So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.
Well at least you don't have to figure out how to test your setup locally.
Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?
It should be! When I was a complete newbie at AWS my first question was why do you have to pick a region, I thought the whole point was you didn't have to worry about that stuff
As far as I know, region selection is about regulation and privacy and guarantees on that.
The region labels found within the metadata are very very powerful.
They make lawyers happy and they stop intelligence services to access the associated resources.
For example, no one would even consider accessing data from a European region without the right paperwork.
Because if they were caught they'd have to pay _thousands_ of dollars in fines and get sternly talked to be high ranking officials.
> another data centre
Yes, within the same region. Doing stuff cross-region takes a little bit more effort and cost, so many skip it.
quay.io was down: https://status.redhat.com
quay.io is down
Now I know why the documents I was sending to my Kindle didn't go through.
Well that takes down Docker Hub as well it looks like.
Yep, was just thinking the same when my Kubernetes failed a HelmRelease due to a pull error…