AWS Multiple Services Down in us-east-1

(health.aws.amazon.com)

361 points | by kondro 5 hours ago ago

142 comments

jacquesm 2 hours ago

Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'

Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

[-]

padjo 2 hours ago

Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

[-]

Waterluvian 2 minutes ago

Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.

mlrtime an hour ago

>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.

Not very many people realize that there are some services that still run only in us-east-1.

[-]

joelthelion 23 minutes ago

Call it the aws holiday. Most other companies will be down anyway. It's very likely that your comp can afford to be down for a few hours, too.

[-]

chii 5 minutes ago

imagine if the electricity supplier too that stance.

[-]

ahoka a minute ago

Isn't that basically Texas?

energy123 an hour ago

It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.

[-]

vrc an hour ago

Well technically AWS has never failed in wartime.

mlrtime an hour ago

I don't understand, peacetime?

[-]

mbreese an hour ago

Peacetime = When not actively under a sustained attack by a nation-state actor. The implication being, if you expect there to be a “wartime”, you should also expect AWS cloud outages to be more frequent during a wartime.

[-]

smaudet 15 minutes ago

Don't forget stuff like natural disasters and power failures...or just a very adventurous squirrel.

AWS (over-)reliance is insane...

mrits an hour ago

It makes a lot more sense if they had a typo of peak

psychoslave 16 minutes ago

This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.

But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.

Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.

davedx 2 hours ago

> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

Absurd claim.

Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

[-]

philipallstar 25 minutes ago

> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing

That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.

But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.

Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.

ants_everywhere 10 minutes ago

I thought we were talking about an AWS outage, not just the outage of a single region? A single region can go out for many reasons, including but not limited to war.

afro88 an hour ago

> If your company is in anything finance-adjacent or critical infrastructure

GP said:

> most companies

Most companies aren't finance-adjacent or critical infrastructure

padjo 34 minutes ago

It’s not absurd, I’ve seen it happen. Company executes on their DR plan due to AWS outage, AWS is back before DR is complete, DR has to be aborted, service is down longer than if they’d just waited.

Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.

antihero 2 hours ago

My website running on an old laptop in my cupboard is doing just fine.

[-]

whatevaa an hour ago

When your laptop dies it's gonna be a pretty long outage too.

[-]

BoredPositron 32 minutes ago

I have two old machines that also work as NAS systems at my brother's and my parents house... it will just failover.

api an hour ago

I have this theory of something I call “importance radiation.”

An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.

[-]

jacquesm 33 minutes ago

That's a great concept. It explains a lot, actually!

lucideer an hour ago

> to the tune of a few hours every 5-10 years

I presume this means you must not be working for a company running anything at scale on AWS.

[-]

skywhopper an hour ago

That is the vast majority of customers on AWS.

[-]

lucideer 23 minutes ago

Ha ha, fair, fair.

sreekanth850 37 minutes ago

Depends on how serious you are with SLA's.

kelseydh 26 minutes ago

It seems like this can be mostly avoided by not using us-east-1.

jacquesm 2 hours ago

Thank you for illustrating my point. You didn't even bother to read the second paragraph.

[-]

chanux 2 hours ago

I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?

I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!

I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.

Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.

shawabawa3 2 hours ago

> Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt

[-]

jacquesm 2 hours ago

Is that also your contingency plan for 'user uploads objectionable content and alerts Amazon to get your account shut down'?

Make sure you let your investors know.

[-]

padjo 40 minutes ago

If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.

[-]

jacquesm 34 minutes ago

> If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.

Where did I say that? If I didn't say it: could you please argue in good faith. Thank you.

[-]

matsemann 22 minutes ago

"Is that also your contingency plan if unrelated X happens", and "make sure your investors know" are also not exactly good faith or without snark, mind you.

I get your point, but most companies don't need Y nines of uptime, heck, many should probably not even use AWS, k8s, serverless or whatever complicated tech gives them all these problems at all, and could do with something far simpler.

mlrtime 2 hours ago

We all read it.. AWS not coming back up is your point on nat having a backup plan?

You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.

I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).

rco8786 19 minutes ago

If AWS goes down unexpectedly and never comes back up it's much more likely that we're in the middle of some enormous global conflict where day to day survival takes priority over making your app work than AWS just deciding to abandon their cloud business on a whim.

[-]

CaptainOfCoit 11 minutes ago

Can also be much easier than that. Say you live in Mexico, hosting servers with AWS in the US because you have US customers. But suddenly the government decides to place sanctions on Mexico, and US entities are no longer allowed to do business with Mexicans, so all Mexican AWS accounts get shut down.

For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.

apexalpha 17 minutes ago

Or Trump decided your country does not deserve it.

ho_schi 19 minutes ago

The internet is a weak infrastructure, relying on a few big cables and data centers. And through AWS and Cloudflare it has become worse? Was it ever true, that the internet is resilient? I doubt.

Resilient systems work autonomously and can synchronize - but don't need to synchronize.

    * Git is resilient.
    * Native E-Mail clients - with local storage enabled - are somewhat resilient.
    * A local package repository is - somewhat resilient.
    * A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.

We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.

[-]

smaudet 2 minutes ago

If you take into account the "the web" vs "the internet" as others have mentioned.

Yes the Internet has stayed stable.

The Web, as defined by a bunch of servers running complex software, probably much less so.

Just the fact that it must necessarily be more complex means that it has more failure modes...

CaptainOfCoit 13 minutes ago

The internet seems resilient enough for all intents and purposes, we haven't had a global internet-wide catastrophe impacting the entire internet as far as I know, but we have gotten close to it sometimes (thanks BGP).

But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.

Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".

pmontra 43 minutes ago

In the case of a costumer of mine the AWS outage manifested itself as Twilio failing to deliver SMSes. The fallback plan has been disabling the rotation of our two SMS providers and sending all messages with the remaing one. But what if the other one had something on AWS too? Or maybe both of them have something else vital on Azure, or Google Cloud, which will fail next week and stop or service. Who knows?

For small and medium sized companies it's not easy to perform an accurate due diligency.

raincole an hour ago

Most companies just aren't important enough to worry about "AWS never come back up." Planning for this case is just like planning for a terrorist blowing up your entire office. If you're the Pentagon sure you'd better have a plan for that. But most companies are not the Pentagon.

[-]

Frieren an hour ago

> Most companies just aren't important enough to worry about "AWS never come back up."

But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.

We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.

[-]

raincole an hour ago

Well I agree. I kinda think the AI apocalypse would not be like a skynet killing us, but a malware be patched onto all the Tesla and causes one million crashes tomorrow morning.

[-]

swader999 16 minutes ago

Battery fires.

hvb2 2 hours ago

> The internet got its main strengths from the fact that it was completely decentralized.

Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.

The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.

[-]

psychoslave 4 minutes ago

Well, that is exactly what resilient distributed network are about. Not that much the technical details we implement them through, but the social relationship and balanced in political decision power.

Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.

lentil_soup 2 hours ago

> Decentralized in terms of many companies making up the internet

Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately

[-]

hvb2 2 hours ago

No we've not lost that at all. Nobody prevents you from doing that.

We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.

[-]

IlikeKitties 2 hours ago

> No we've not lost that at all. Nobody prevents you from doing that.

May I introduce you to our Lord and Slavemaster CGNAT?

[-]

yupyupyups 42 minutes ago

That depends on who your ISP is.

jacquesm 2 hours ago

I think one reason is that people are just bad at statistics. Chance of materialization * impact = small. Sure. Over a short enough time that's true for any kind of risk. But companies tend to live for years, decades even and sometimes longer than that. If we're going to put all of those precious eggs in one basket, as long as the basket is substantially stronger than the eggs we're fine, right? Until the day someone drops the basket. And over a long enough time span all risks eventually materialize. So we're playing this game, and usually we come out ahead.

But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.

[-]

hvb2 2 hours ago

Absolutely, but the cost of perfection (100% uptime in this case) is infinite.

As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?

[-]

jacquesm 2 hours ago

Often simply the lack of a backup outside of the main cloud account.

[-]

hvb2 2 hours ago

Sure, but on a typical outage how likely is it that you'll have that all up and running before the outage is resolved?

And secondly, how often do you create that backup and are you willing to lose the writes since the last backup?

That backup is absolutely something people should have, but I doubt those are ever used to bring a service back up. That would be a monumental failure of your hosting provider (colo/cloud/whatever)

[-]

jacquesm 39 minutes ago

> Sure, but on a typical outage how likely is it that you'll have that all up and running before the outage is resolved?

Not, but if some Amazon flunky decides to kill your account to protect the Amazon brand then you will at least survive, even if you'll lose some data.

freetanga 41 minutes ago

Additionally I find that most Hyperscalers are trying to lock you in, by tailoring services which are industry standard with custom features which end up building roots and making a multi-vendor or lift and shift problematic.

Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…

[-]

jacquesm 35 minutes ago

Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.

invalidusernam3 28 minutes ago

What if the fall-back also never comes back up?

anal_reactor an hour ago

First, planning for AWS outage is pointless. Unless you provide service of national security or something, your customers are going to understand that when there's global internet outage your service doesn't work either. The cost of maintaining a working failover across multiple cloud providers is just too high compared to potential benefits. It's astonishing that so few eningeers understand the fact that maintaining a technically beautiful solution costs time and money, which might not make a justified business case.

Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.

Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?

[-]

jacquesm 36 minutes ago

> Let me ask you: how do you prepare your website for the complete collapse of western society?

How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?

> Second, preparing for the disappearance of AWS is even more silly.

What's silly is not thinking ahead.

csomar an hour ago

> Now imagine for a bit that it will never come back up.

Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.

kuon a minute ago

I realize that my basement servers have better uptime than AWS this year!

I think most sysadmin don't plan for AWS outage. And economically it makes sense.

But it makes me wonder, is sysadmin a lost art?

chibea 44 minutes ago

One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?

Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?

[-]

wwdmaxwell 17 minutes ago

I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.

Both of which seem to prop up in post mortems for these widespread outages.

cowsandmilk 31 minutes ago

Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.

glemmaPaul 3 hours ago

LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?

[-]

DrScientist 3 hours ago

According to their status page the fault was in DNS lookup of the Dynamo services.

Everything depends on DNS....

[-]

mlrtime an hour ago

Dynamo had a outage last year if I recall correctly.

KettleLaugh 2 hours ago

We maybe distributed, but we die united...

[-]

hangsi 2 hours ago

Divided we stand,

United we fall.

glemmaPaul 2 hours ago

AWS Communist Cloud

[-]

nyrp 16 minutes ago

>circa 2005: Score:5, Funny on Slashdot

>circa 2025: grayed out on Hacker News

Hamuko 3 hours ago

I thought it was a pretty well-known issue that the rest of AWS depends on us-east-1 working. Basically any other AWS region can get hit by a meteor without bringing down everything else – except us-east-1.

[-]

yellow_lead 2 hours ago

But it seems like only us-east-1 is down today, is that right?

[-]

dikei 2 hours ago

Some global services have control plane located only in `us-east-1`, without which they become read-only at best, or even fail outright.

https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

philipallstar 3 hours ago

Just don't buy it if you don't want it. No one is forced to buy this stuff.

[-]

benterix 3 hours ago

> No one is forced to buy this stuff.

Actually, many companies are de facto forced to do that, for various reasons.

[-]

philipallstar 2 hours ago

How so?

[-]

jacquesm 2 hours ago

Certification, for one. Governments will mandate 'x, y and/or z' and only the big providers are able to deliver.

[-]

mlrtime an hour ago

That is not the same as mandating AWS, it just means certain levels of redundancy. There are no requirements to be in the cloud.

[-]

jacquesm 42 minutes ago

No, that's not what it means.

It means that in order to be certified you have to use providers that in turn are certified or you will have to prove that you have all of your ducks in a row and that goes way beyond certain levels of redundancy, to the point that most companies just give up and use a cloud solution because they have enough headaches just getting their internal processes aligned with various certification requirements.

Medical, banking, insurance to name just a couple are heavily regulated and to suggest that it 'just means certain levels of redundancy' is a very uninformed take.

63stack 2 hours ago

Security/compliance theater for one

[-]

philipallstar 2 hours ago

That's not a company being forced to, though?

[-]

aembleton an hour ago

It is if they want to win contracts

[-]

philipallstar an hour ago

I don't think that's true. I think a company can choose to outsource that stuff to a cloud provider or not, but they can still choose.

tonypapousek an hour ago

Looks like they’re nearly done fixing it.

> Oct 20 3:35 AM PDT

> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.

[-]

chibea an hour ago

It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...

[-]

rswail 33 minutes ago

In that region, other regions are able to launch EC2s and ECS/EKS without a problem.

weberer 2 hours ago

Llama-5-beelzebub has escaped containment. A special task force has been deployed to the Virginia data center to pacify it.

kalleboo 4 hours ago

It's fun watching their list of "Affected Services" grow literally in front of your eyes as they figure out how many things have this dependency.

It's still missing the one that earned me a phone call from a client.

[-]

zenexer 3 hours ago

It's seemingly everything. SES was the first one that I noticed, but from what I can tell, all services are impacted.

hvb2 2 hours ago

In AWS, if you take out one of dynamo db, S3 or lambda you're going to be in a world of pain. Any architecture will likely use those somewhere including all the other services on top.

If in your own datacenter your storage service goes down, how much remains running

mlrtime an hour ago

When these major issues come up, all they have is symptoms and not causes. Maybe not until the dynamo oncall comes on and says its down, then everyone knows at least the reason for their teams outage.

The scale here is so large they don't know the complete dependency tree until teams check-in on what is out or not, growing this list. Of course most of it is automated, but getting on 'Affected Services' is not.

nextaccountic 2 hours ago

Is this why reddit is down? (https://www.redditstatus.com/ still says it is up but with degraded infrastructure)

[-]

krowek 2 hours ago

Shameless from them to make it look like it's a user problem. It was loading fine for me one hour ago, now I refresh the page and their message states I'm doing too many requests and should chill out (1 request per hour is too many for you?)

[-]

etothet 35 minutes ago

Never ascribe to malice that which is adequately explained by incompetence.

It’s likely that, like many organizations, this scenario isn’t something Reddit are well prepared for in terms of correct error messaging.

anal_reactor an hour ago

I remember that I made a website and then I got a report that it doesn't work on newest Safari. Obviously, Safari would crash with a message blaming the website. Bro, no website should ever make your shitty browser outright crash.

kaptainscarlet 2 hours ago

I got a rate limit error which didn't make sense since it was my first time opening reddit in hours.

Aldipower 2 hours ago

My minor 2000 users web app hosted on Hetzner works fyi. :-P

[-]

aembleton an hour ago

Right up until the DNS fails

[-]

Aldipower an hour ago

I am using ClouDNS. That is an AnycastDNS provider. My hopes are that they are more reliable. But yeah, it is still DNS and it will fail. ;-)

mlrtime an hour ago

But how are you going to web scale it!? /s

[-]

Aldipower an hour ago

Web scale? It is an _web_ app, so it is already web scaled, hehe.

Seriously, this thing runs already on 3 servers. A primary + backup and a secondary in another datacenter/provider at Netcup. DNS with another AnycastDNS provider called ClouDNS. Everything still way cheaper then AWS. The database is already replicated for reads. And I could switch to sharding if necessary. I can easily scale to 5, 7, whatever dedicated servers. But I do not have to right now. The primary is at 1% (sic!) load.

There really is no magic behind this. And you have to write your application in a distributable way anyway, you need to understand the concepts of stateless, write-locking, etc. also with AWS.

igleria 3 hours ago

funny that even if we have our app running fine in AWS europe, we are affected as developers because of npm/docker/etc being down. oh well.

[-]

dijit 2 hours ago

AWS has made the internet into a single-point-of failure.

What's the point of all the auto-healing node-graph systems that were designed in the 70s and refined over decades: if we're just going to do mainframe development anyway?

[-]

voidUpdate 2 hours ago

To be fair, there is another point of failure, Cloudflare. It seems like half the internet goes down when Cloudflare has one of their moments

polaris64 3 hours ago

It looks like DNS has been restored: dynamodb.us-east-1.amazonaws.com. 5 IN A 3.218.182.189

[-]

miyuru 2 hours ago

I wonder if the new endpoint was affected as well.

dynamodb.us-east-1.api.aws

testemailfordg2 an hour ago

Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.

[-]

arielcostas 8 minutes ago

But they aren't abusing their market power, are they? I mean, they are too big and should definitely be regulated but I don't think you can argue they are much of a monopoly when others, at the very least Google, Microsoft, Oracle, Cloudflare (depending on the specific services you want) and smaller providers can offer you the same service and many times with better pricing. Same way we need to regulate companies like Cloudflare essentially being a MITM for ~20% of internet websites, per their 2024 report.

gbalduzzi 3 hours ago

Twilio is down worldwide: https://status.twilio.com/

thomas_witt 3 hours ago

Seems to be really only in us-east-1, DynamoDB is performing fine in production on eu-central-1.

shinycode 2 hours ago

It’s that period of the year when we discover AWS clients that don’t have fallback plans

croemer 3 hours ago

whatsupdog 3 hours ago

I can not login to my AWS account. And, the "my account" on regular amazon website is blank on Firefox, but opens on Chrome.

Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.

atymic 5 hours ago

https://news.ycombinator.com/item?id=45640754

XCSme 3 hours ago

Yeah, noticed from Zoom: https://www.zoomstatus.com/incidents/yy70hmbp61r9

TrackerFF 2 hours ago

Lots of outage happening in Norway, too. So I'm guessing it is a global thing.

jpfromlondon 3 hours ago

This will always be a risk when sharecropping.

hubertzhang 2 hours ago

I cannot pull images from docker hub.

BaudouinVH 2 hours ago

canva.com was down until a few minutes ago.

croemer 3 hours ago

Coinbase down as well

killingtime74 3 hours ago

Signal is down for me

[-]

miduil 2 hours ago

Yes. https://status.signal.org/

    >  Signal is experiencing technical difficulties. We are working hard to restore service as quickly as possible.

Edit: Up and running again.

tosh 2 hours ago

SES and signal seem to work again

xodice 2 hours ago

Major us-east-1 outages happened in 2011, 2015, 2017, 2020, 2021, 2023, and now again. I understand that us-east-1, N. VA, was the first DC but for fucks sake they've had HOW LONG to finish AWS and make us-east-1 not be tied to keeping AWS up.

[-]

hvb2 2 hours ago

First, not all outages are created equal, so you cannot compare them like that.

I believe the 2021 one was especially horrific because of it affecting their dns service (route53) and the outage made writes to that service impossible. This made fail overs not work etcetera so their prescribed multi region setups didn't work.

But in the end, some things will have to synchronizes their writes somewhere, right? So for dns I could see how that ends up in a single region.

AWS is bound by the same rules as everyone else in the end... The only thing they have going for them that they have a lot of money to make certain services resilient, but I'm not aware of a single system that's resilient to everything.

[-]

xodice 2 hours ago

If AWS fully decentralized its control planes, they’d essentially be duplicating the cost structure of running multiple independent clouds and I understand that is why they don't however as long as AWS is reliant upon us-east-1 to function, they have not achieved what they claim to me. A single point of failure for IAM? Nah, no thanks.

Every AWS “global” service be it IAM, STS, CloudFormation, CloudFront, Route 53, Organizations, they all have deep ties to control systems originally built only in us-east-1/n. va.

That's poor design, after all these years. They've had time to fix this.

Until AWS fully decouples the control plane from us-east-1, the entire platform has a global dependency. Even if your data plane is fine, you still rely on IAM and STS for authentication and maybe Route 53 for DNS or failover CloudFormation or ECS for orchestration...

If any of those choke because us-east-1’s internal control systems are degraded, you’re fucked. That’s not true regional independence.

[-]

hvb2 an hour ago

You can only decentralized your control plane if you don't have conflicting requirements?

Assuming you cannot alter requirements or SLAs, I could see how their technical solutions are limited. It's possible, just not without breaking their promises. At that point it's no longer a technical problem

[-]

xodice an hour ago

In the narrow distributed-systems sense? Yes, however those requirements are self-imposed. AWS chose strong global consistency for IAM and billing... they could loosen it at enormous expense.

The control plane must know the truth about your account and that truth must be globally consistent. That’s where the trouble starts I guess.

I think my old-school system admin ethos is just different than theirs. It's not a who's wrong or right, just a difference in opinions on how it should be done I guess.

The ISP I work for requires us to design in a way that no single DC will cause a point of failure, just difference in design methods and I have to remember the DC I work in is completely differently used than AWS.

In the end however, I know solutions for this exist (federated ledgers, CRDT-based control planes, regional autonomy but they’re just expensive and they don’t look good on quarterly slides), it just takes the almighty dollar to implement and that goes against big business, if it "works" it works, I guess.

AWS’s model scales to millions of accounts because it hides complexity, sure but the same philosophy that enables that scale prevents true decentralization. That is shit. I guess people can architect as if us-east-1 can disappear so that things can continue on, but then thats AWS causing complexity in your code. They are just shifting who is shouldering that little known issue.

gramakri2 3 hours ago

npm registry also down

askonomm 4 hours ago

Docker is also down.

[-]

1659447091 3 hours ago

Also:

Snapchat, Ring, Roblox, Fortnite and more go down in huge internet outage: Latest updates https://www.the-independent.com/tech/snapchat-roblox-duoling...

To see more (from the first link): https://downdetector.com

throw-10-13 2 hours ago

this is why you avoid us-east-1

gritzko 3 hours ago

idiocracy_window_view.jpg

empressplay 4 hours ago

Can't check out on Amazon.com.au, gives error page

[-]

kondro 4 hours ago

This link works fine from Australia for me.

DataDaemon 3 hours ago

But but this is a cloud, it should exist in the cloud.