AWS outage shows internet users 'at mercy' of too few providers, experts say

(theguardian.com)

233 points | by evolve2k 5 hours ago ago

162 comments

dang 2 hours ago

Related ongoing thread:

AWS Multiple Services Down in us-east-1 - https://news.ycombinator.com/item?id=45640838 -(1650 comments so far)

The 'experts' also made similar criticisms with the Fastly outage in 2021 and did anything obvious change as a result? In a week's time no national newspapers will be talking about this.

Meanwhile, everyone that spends actual time in these areas:

- Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.

- Understands that the cost of actually accounting for this kind of scenarios is incredibly high for the benefit in most cases

- Knows that genuinely 'critical' services (i.e. health) should be designed to account for this, and every other 'serious' issue such as 'I can't log in to Fortnite' just shows what the price and effort of actually making that work is versus how much it costs affected companies when it happens

- Knows how much time national newspapers spend actually talking about the importance of multi-region/multi-cloud redundancy, that is, it's zero until the one day where it happens and then it's old news

- Is just curious as to just what exactly happened from a technical perspective

This isn't to say that good blameless post-mortem shouldn't happen to figure out process and technical issues, but the armchair criticism with no actual followup? All noise, no signal.

[-]

imgabe 3 hours ago

The "experts" in this case are

> Dr Corinne Cath-Speth, the head of digital at human rights organisation Article 19

Dr. Cath-Speth has a PhD in cultural anthropology

> Cori Crider, the executive director of the Future of Technology Institute

A lawyer

> Madeline Carr, professor of global politics and cybersecurity at University College London

A professor. Her bio doesn't say what her degree is in, but she mostly seems to publish in political science and international relations

So, not a single technical expert. Not anyone who has ever run a hosting service before or even worked for one. Just people who write papers and sit around waiting for journalists to call them for quotes.

[-]

kopecs 3 hours ago

Do you not think it a bit too hyperbolic to throw scare quotes around experts and imply the only people who can have opinions on systemic risk are software engineers? I don't think it is unreasonable for people who haven't run or worked for a hosting service to have opinions on the policy aspect or economic impact of hyperscalers.

[-]

sunrunner 3 hours ago

> I don't think it is unreasonable for people who haven't run or worked for a hosting service to have opinions on the policy aspect or economic impact of hyperscalers.

Yeah, that's completely fair. My angle was more that firstly this doesn't come across as an opinion that needs the expert in question, and secondly this is yet another case of 'Talk is cheap, show me the code', particularly when quotes in the article include "We urgently need diversification in cloud computing."

I feel like the 'We' is doing an awful lot of heavy lifting and there's no mention of the costs of taking on such a task.

Additionally, and awkwardly, it's possible to be both a monopoly in the space but also technically a more stable solution, making the cost for competitors or people willing to use competitors doubly high.

Edit: Realised afer the fact I'm GP to your post, assumed it was mine, keeping the words anyway.

imgabe 3 hours ago

Anyone can have an opinion, I never said or implied otherwise. Having an opinion does not make one an expert, hence the scare quotes.

The headline is misleading because when there is news about experts saying something about technology, one would naturally think that they are at least somewhat technical experts. Instead the "expert" is the director of the "Big Tech is Bad Institute" who says that "Big Tech is Bad". And their qualification of being an expert is solely that they are director of the "Big Tech is Bad Institute".

[-]

ghaff 2 hours ago

And one would hope that the stats being quoted about desktop share were from someone who has been at that research firm in the last 20 years or so. I'm not sure how active he is at all at this point. I have a feeling someone looking for some stats found something old that may or may not have actually had a date on it.

(If I'm wrong mea culpa but I'm pretty sure.)

ufmace 2 hours ago

No, it's 100% appropriate. Anyone can have opinions on anything, but frankly, most of them have little relevance to reality. Their use of the word "expert" is supposed to mean the person has knowledge or expertise that renders their opinion on a subject substantially more valid and relevant than any regular person. That clearly is not the case here. If I wanted to know what a random person on the street thought about a subject, I could go ask one myself. The purpose of news organizations was supposed to be to better-inform people by getting opinions from actual relevant experts in a subject.

These people don't seem to have much ability to discuss relevant subjects like what the actual reliability of lower-tier hosting providers is, the value-add to business and iteration speed of having a variety of extra services (SQS, DynamoDB, VPC, RDS, managed K8s, etc) available, etc.

jimbokun 3 hours ago

I don’t think it’s useful at all.

What are they going to say that’s useful for making concrete technical decisions?

They can advise on how to write contracts for dealing with these situations after the fact, I suppose.

mhb 3 hours ago

Experts said that cloth masks would protect you from a deadly virus.

[-]

Waterluvian 2 hours ago

Right?! Same with seatbelts. I don’t wear mine because there’s obviously still automobile deaths. Experts said seatbelts would protect us from deadly accidents. What else are they wrong about?!

[-]

mhb 2 hours ago

That counterargument might make sense if seat belts were not generally protective in accidents or if experts were telling you to wear crepe paper seat belts instead of nylon ones because the nylon ones were needed elsewhere.

wagwang 3 hours ago

Opinions are valid but also worthless. Just give me a funny tweet to digest the situation.

zenoprax 3 hours ago

I think your third point is what I've had to attune to when criticizing cloud dependence. I think if your entire source of revenue is dependent on AWS then you should be prepared for 16+ hours of downtime per year. Individuals notice it more when something is down for hours but with good observability I am guessing the business notices it more when performance drags for the other 8742 hours of the year. Bursts of downtime per day can still be attributed to the device, wifi, ISP, or some other intermediary's DNS/BGP.

If your margins are so tight that 16 hours of downtime will bankrupt you then I think either: a) I have no idea how to run a business; or b) you have no idea how to run a business. I'm also biased because I love highly fault-tolerant, geo-redundant, durable systems much more than "good enough for this KPI".

[-]

sunrunner 2 hours ago

> but with good observability I am guessing the business notices it more when performance drags for the other 8742 hours of the year

This is really good point that aligns with my experience. Today's event was LOUD and (compared to other incidents) long, but perhaps not really that long compared to the situation you describe that for most businesses is going to be more pernicious.

Business intelligence and analytics-type folks at $DAYJOB are _very_ watchful for the year-on-year deviations and even periods where the prediction lines didn't match up for even just a few hours.

BrenBarn 3 hours ago

I think all of that is mostly irrelevant. You don't need to pay a huge cost to avoid the small benefit, you don't need every service to be resilient to this, or any of that. You just need multiple different providers so that not everyone gets screwed at once.

[-]

sunrunner 3 hours ago

But that would require companies to actually spend time and money testing and working with either a cross-provider multi-master-type system (with all the associated consistency headaches) or regularly test a functioning disaster-recovery/fallback system.

The time spent on that (let alone cost, for companies with large amounts of data) far outweighs the cost when a single region has an issue of today's scope. And you said it yourself, it's a 'small benefit'. Small benefits sound like exactly the things not worth spending time or money on.

For as much as many companies have had issues today, the daily reality is that these same companies haven't been having issues all the rest of the time (or this wouldn't have felt so shocking) and are likely to be okay with an outage of this scope (plus, everyone's too busy making noise about the issues to be working normally).

bamboozled 3 hours ago

Yes but we live in a highly anti-competitive monopolized world now. With more to come under the new admin.

[-]

hdgvhicv an hour ago

There’s two or three gartner approved ways of doing things for fortune 500 ctos, and f500 wannabes.

It’s not a monopoly but it’s close.

jdminhbg 3 hours ago

It’s hard to think of anything less monopolized than cloud hosting. There are hundreds of providers.

[-]

bamboozled 2 hours ago

Yeah right, and how many of them have any substantial customer base compared to AWS and Azure?

estimator7292 3 hours ago

For any business that matters, your choices are amazon, google, Microsoft, and that's about it.

I couldn't even name another provider except maybe Hetzner

[-]

bamboozled 2 hours ago

The three you mentioned have over 60% market share which is why this article exists at all. Knowing what I know about cloud ifnra, anyone who is actually anyone is hosting on the big three. So it's not just a market share, it's market share + impact / importance.

You could also argue that YT is on GCP (to some level) and that would probably bump that number up much higher.

The vast majority of people hosting things on the internet are on these providers. But you get downvoted for pointing that out now.

alecco 3 hours ago

> - Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.

NO. From their own reports, clearly AWS is too centralized and dependent on a specific region (us-east-1) and a specific service (DynamoDB). This has been observed for well over 10 years. Why do they stay in this centralized architecture? Cloud services need much higher standards than the average corporation. Just look how they took down 2000+ services for many hours.

[1] https://health.aws.amazon.com/health/status

[-]

inopinatus 3 hours ago

Even wearing my ex-AWS hat and understanding to some degree the internal complexity of these services, I too am boggled that foundational stuff is still out of Virginia and not a separately operated global region for the subset of control-plane dependencies that can’t be refactored into tolerating eventual consistency (such as parts of IAM).

We always used to talk a lot about minimising blast radius and there’s been enough time, and enough scale, to fix it.

Nevertheless the Guardian’s choice to label self-promoting policy wonks as “experts” is a cringe-inducing reminder that journalists don’t know anything about anything.

sunrunner 3 hours ago

I don't deny that an incident of this scope should prompt a serious technical and process review (and as you describe it, it sounds like this is long overdue), however how often does this kind of thing not affect 2000+ services? Companies should be tracking the time they don't have issues as much as the time they do in order to actually understand if they'd be better off elsewhere.

And to be clear, I'm not at all arguing for the monopolisation of cloud providers, only stating that it's easy to point from far away and say 'This is bad' while simultaneously not doing anything to understand the cost and make that change that you say is important, because it's actually costly (in many dimensions) to do.

A4ET8a8uTh0_v2 2 hours ago

Um.. you don't need to be an expert in security, comp.science or economics to know that putting all eggs in one basket may not be a great idea as introduces one giant systemic target. If anything, regular people here are uniquely qualified to say something along the lines of:

Oi, this is ridiculous. Maybe more things should be ran locally..

FWIW, it was instructive to me as to which companies were not able to function today.

hippo77 3 hours ago

These are Guardian 'experts' so can be safely ignored.

sysguest 2 hours ago

> - Knows that genuinely 'critical' services (i.e. health) should be designed to account for this

yeah but aws advertises as "trust me bro I won't go down for 99.99999%"

I've seen a lot of gov proposals using aws to 'get away with downtime management'

gnerd00 3 hours ago

maybe your VC overlords need a reality check?

free_bip 3 hours ago

Because the experts have no say in policy. The only people who have a say are the people bribing (sorry I mean "lobbying") Congress. And even they have very little say because Congress is currently on a hot streak of doing absolutely nothing.

labrador 4 hours ago

Kieran Healy @kjhealy@mastodon.social

Always worth taking sentences that use “the Cloud” or “the Internet” and try replacing those phrases with “A shed in Virginia” to see how they hold up. “Our service is fully based in a shed in Virginia”; “All my files are in a shed in Virginia”; “A shed in Virginia was designed to survive a nuclear war”, etc.

https://mastodon.social/@kjhealy/115407725852594322

[-]

SpicyLemonZest 4 hours ago

Sounds like a pretty good shed! Like a lot of pithy commentary on the cloud, this ignores the fact the practical alternative to a shed in Virginia for most businesses is a shelf in the supply closet. "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.

[-]

gspencley 3 hours ago

> "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.

You're not entirely wrong, but you're being hyperbolic too. I'm actually curious how old you are / how long you've worked in tech, because I started out pre-cloud and things weren't nearly as bad or as limited as you suggest.

First, on-prem servers are not the only alternative to "cloud." Many businesses, including the ones I worked for, did co-location. The companies owned their own bare metal servers, but would rent a rack in a data centre, and certain things - like the network admin - was entirely outsourced to the data centre / hosting company.

You could also rent managed bare metal servers (you still can). This means that you can pretty much outsource your entire IT department, but you're still not doing cloud services. Meaning you've got bare metal servers, someone you're paying at the hosting company is handling security updates and troubleshooting. You don't get things like auto-scaling or serverless or other cloud features, but you also don't have to worry about Jim tripping over the power cable either.

There's also still virtual servers. Which is basically a VM running on a server that hosts multiple clients.

All of this is to say that the alternative is not "cloud" or "box in a closet." The alternative is "cloud" and a ton of different server options: owned, rented, co-located, on-prem, dedicated, virtual, managed v un-managed (outsource IT vs admin your own) and the list goes on and on.

[-]

maccard 3 hours ago

We run a subset of our CI workload on on-prem workstations because the cost/performance ratio of consumer hardware is so much higher than servers. 1TB NVMe drive, with a 7950x/i9, 64GB RAM and gigabit networking is < $1000. It actually completes our CI job faster than AWS restarts a gpu instance.

100% of our failure rates with this machine have been "carpet cleaners unplugged the machine" in 2 years. Last year we had nobody in the office (due to carpet cleaning). This year we sent someone in straight after the cleaning to fix it.

SpicyLemonZest 2 hours ago

I've never managed IT professionally myself (pre-cloud or otherwise), so a lot of my information comes from family members who do, but my impression is that bare metal rental and colo centers weren't realistic options for any but the most technically sophisticated organizations. I know schools, stores, even research centers who went straight from on-prem to managed cloud with no real consideration for anything in between.

Spivak 2 hours ago

But is the distinction meaningful? The alternative to a shed in Virginia is a different shed in Montana? I mean sure there are a lot of different sheds out there but they're all still sheds. They're all shared responsibility models where the line is drawn in different areas, some outages will be because of your fuckup, some will be theirs.

Not saying as an industry we shouldn't diversify a little but it doesn't fundamentally change the relationship each company has to their hosting provider.

noir_lord 3 hours ago

Once had a site wide outage (biggish manufacturing company) of the internet and backup servers because one of the women wanted to plug her hair straighteners in for the xmas party.

In a surprise to literally no one that happening on the last friday before xmas break got my "We need to secure the main comms cabinet" (which had the backup server and main ingress for WAN and was in a separate building on other side of site) item that I'd been asking about for months to the top of the list.

Still one of my favourite "outages" because I got to my desk, turned PC on, no network, walked across the landing into the main office, opened comms cabinet, plugged it back in and was "resolved" before the MD got to my desk.

darkwater 3 hours ago

With a gazillion of shelves, closets, Jims and cables. So if Fortnite's Jim trips on a wire, Canva's Jim is quitely sipping coffee at his desk.

franz_vlkshp 3 hours ago

on the other hand, that's a small price to pay to having total control and physical access to your own infrastructure. if the sysadmin did his job properly, an incident like that shouldn't require anything else but to plug the server back in and hit the power switch. but then if he did his job properly, no one but IT should be tripping on power cables to begin with.

[-]

hdgvhicv an hour ago

My home infrastructure is immune from the “unplug” problem by being hosted on two different 10 year old £30 raspberry pis in different rooms.

But apparently that’s too hard for the average.

jacobsenscott 2 hours ago

Largely mitigated by twist lock sockets.

impure 5 hours ago

We already have diversification. You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS. What we have here is a lock-in and marketing problem.

[-]

jasode 4 hours ago

>You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS.

Companies are using higher-level "PaaS" suite of services from AWS such as DynamoDB, RedShift, etc and not just the lower-level "IaaS" such as basic EC2 instances or pure containers. Same "lock-in" situation with using the higher-level services from MS Azure and Google Cloud.

For those dependent on high-level services, migrating to a VPS like Hetzner or self-hosting is not possible unless they re-invent the AWS stack by installing/babysitting a bunch of open-source software. It's going to be a lot more involved than just installing a PostgreSQL db instance on a VPS.

[-]

SoftTalker 4 hours ago

> It's going to be a lot more involved

Yes, and you can't escape that by outsourcing it. The complexity is still there, and it will still bite you when your outsourcer fails to manage it.

[-]

candiddevmike 4 hours ago

Same thing applies to AWS...

[-]

throwaway894345 4 hours ago

I’m not really making a point here as much as an observation, but if my stack that I manage atop VMs in a data center goes down, my customers are pissed at me. If AWS goes down along with half the Internet, my customers are completely sympathetic.

[-]

candiddevmike 3 hours ago

Maybe just for you and after they realize it's part of the ongoing AWS outages, but for most folks, an outage is still their problem, and their SLA, regardless of if it's upstream from them.

TZubiri 4 hours ago

Amazon offers VPS as well, EC2 instances, were those affected? I think they weren't.

[-]

swiftcoder 4 hours ago

Our actual running instances were pretty much fine throughout, as was the RDS cluster, but we had no way to launch new instances (or auto-scale), and no way to invoke any of the other AWS services (IAM, SQS, Lambda, etc). Also no cloud watch logs/metrics for the duration, so limited visibility.

Overall not that bad for us, but if you had more high-level service dependencies, there would have been impact.

TYPE_FASTER 4 hours ago

> While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates.

> We continue to investigate the root cause for the network connectivity issues that are impacting AWS services such as DynamoDB, SQS, and Amazon Connect in the US-EAST-1 Region. We have identified that the issue originated from within the EC2 internal network.

So, kinda? Some global services depend on us-east-1...

> Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues.

Basically, you know it's going to be a bumpy day when us-east-1 has an issue because your ability to run across regions depends on what the issue is what the impact is.

morshu9001 5 hours ago

The expert opinions are more about geopolitics, like maybe don't have all your country's systems realtime depend on a foreign company.

If you are just one company whose goal is to maximize uptime without bringing in the complexity of multi-cloud, relying on AWS is reasonable. You probably won't get better uptime using something else, you'll only be down at different times than most others, which in most cases is actually worse.

[-]

kristianc 5 hours ago

For the kind of person being quoted, the stock in trade is not actually doing anything to fix it, it's in being the person quoted when something goes wrong.

dynamite-ready 5 hours ago

The whole industry walked straight into the cloud service lock-in trap. How would we begin to wind back? I also think Docker is as much to blame as the bigger cloud vendors.

[-]

spjt 3 hours ago

I don't think it wants to. Ask any on-call engineer or support tech how they felt when, after having their phone blow up at 1am because everything is falling apart, they found out that this was an AWS-wide outage.

Jcowell 5 hours ago

Why is docker to blame?

[-]

dynamite-ready 4 hours ago

It's subjective I guess, but I feel as though containerisation has greatly supported the large Cloud vendor's desire to subvert the more common model of computing... Like, before, your server was a computer, much like your desktop machine, and you programmed it much like your desktop machine.

But now, people are quite happy to put their app in a Docker container and outsource all design and architecture decisions pertaining to data storage and performance.

And with that, the likes of ECS, Dynamo, RedShift, etc, are a somewhat reasonable answer to that. It's much easier to offer a distinct proposition around that state of affairs, than say a market that was solely based on EC2-esque VMs.

What I did not like, but absolutely expected, was this lurch towards near enough standardising one specific vendor's model. We're in quite a strange place atm, where AWS specific knowledge might actually have a slightly higher value than traditional DevOps skills for many organisations.

Felt like this all happened both at the speed of light, and in slow motion, at the same time.

[-]

godelski 3 hours ago

Containers let me essentially build those machines but at the actual requirements I need for a particular system. So instead of 10 machines I can build 1. I then don't need to upgrade that machine if my service changes.

Its also more resilient because I can trash a container and load up a new one with low overhead. I can't really do that with a full machine. It also gives some more security by sandboxing.

This does lead to laziness by programmers accelerated by myopic management. "It works" except when it doesn't. Easy to say you just need to restart the container then to figure out the actual issue.

But I'm not sure what that has to do with cloud. You'd do the same thing self hosting. Probably save money too. Though I'm frequently confused why people don't do both. Self host and host in the cloud. That's how you create resilience. Though you also need to fix problems rather than restart to be resilient too.

I feel like our industry wants to move fast but without direction. It's like we know velocity matters but since it's easier to read the speedometer we pretend they're the same thing. So fast and slow makes sense. Fast by magnitude of the vector. Slow if you're measuring how fast we make progress in the intended direction.

pythonaut_16 3 hours ago

I don't see how Docker makes that worse.

Before Docker you had things like Heroku and Amazon Elastic Beanstalk with a much greater degree of lock in than Docker.

ECS and its analogues on the other cloud providers have very little lock in. You should be able to deploy your container to any provider or your own VM. I don't see what Dynamo and data storage have to do with that. If we were all on EC2s with no other services you'd still have to figure out how to move your data somewhere else?

Like I truly don't understand your argument here.

throwaway894345 3 hours ago

Containers have nothing to do with storage. They are completely orthogonal to storage (you can use Dynamo or RedShift from EC2), and many people run Docker directly on VMs. Plenty of us still spend lots of time thinking about storage and state even with containers.

Containers allow me to outsource host management. I gladly spend far less time troubleshooting cloud-init, SSH, process managers, and logging/metrics agents.

[-]

dynamite-ready 3 hours ago

> Containers have nothing to do with storage. They are completely orthogonal to storage

Exactly.

And sure, you can use S3/Dynamo/Aurora from an EC2 box, but what would be the point of that? Just get the app running in a container, and we can look into infrastructure later.

It's a very common refrain. That's why I believe Docker is strongly to linked the development of these proprietary, cloud based models of computing, that place containerisation at the heart of an ecosystem that bastardises the classic idea of a 'server'.

The existence of S3 is one good result of this. IAM, on the other hand, can die in dumpster fire. Though it won't...

ryandvm 4 hours ago

Man, I did not have "AWS us-east-1 will only have TWO 9s this year" on my bingo card.

[-]

aurumque 4 hours ago

For those of us who have been using AWS for almost 20 years now, I can't imagine why anyone would willingly choose us-east-1 for anything. It is the oldest, highest traffic, most critical path region and is subject to turbulence.

[-]

tlogan 4 hours ago

I think it is a little complicated. For example, your service might be using full failover but you use API from other service which are down.

Or you might use BART to come to work and you got stuck: https://www.kqed.org/news/12060687/bart-resumes-service-but-...

[-]

dingnuts 4 hours ago

ha! I saw another comment on here talking about how ec2 doesn't need to be held to the same standard as the power company because it's not as important as real infrastructure.

wish I'd already had this link in my back pocket. our industry needs to take its job, as a whole, much more seriously.

captainkrtek 3 hours ago

“Global” and “edge” services such as IAM, Route53, CloudFront and so on have dependencies on us-east-1, so even if you don’t think you do, you probably do.

interroboink 4 hours ago

By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice. I.e. reasons in favor.

Not that I disagree with you, but maybe not for the reasons you say (:

[-]

swiftcoder 4 hours ago

> By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice

As someone who used to work on the inside, us-east-1 has the biggest pile of legacy workarounds for internal AWS issues, it has a variety of legacy API behaviours that don't exist in other regions, and because everyone picks it as the default, it has significantly more pressure on contested resources (i.e. things like spot instance pools).

Plus since it's the default in all the tooling, if you ever decide to go multi-region, you'll find tons of things break right away.

morshu9001 4 hours ago

It can make sense to depend on the thing that will attract massive worldwide attention if/when it goes down. Or, more likely, it's just a default people don't change.

bongodongobob 3 hours ago

Well, we didn't, but some of our third party softwares did. Hard to avoid.

TZubiri 4 hours ago

Wait, was the whole region affected? Like even if you had an EC2 instance?

[-]

mads_quist 4 hours ago

No, we run on US East 1 but only EC2. Everything was running smoothly!

[-]

mads_quist 4 hours ago

Our strategy has always been to use as little higher abstractions from cloud providers as possible. Glad we went this way, saved us quite a bunch of SLA breaches today! I am confident to say that it's "best of both worlds". We get great availability zone redundancy by AWS without having to rely on and pay for all those PaaS stuff the cloud giants offer. Also, we can "fairly easy" migrate to any other cloud provider because we only need Debian instances running.

bigstrat2003 3 hours ago

Yes, it was. We have EC2 instances that we turn on as-needed, and at times were unable to start said instances.

neom 5 hours ago

Been a while since I worked in cloud but at least when I got out of it, the primitives where all shoring up to be generally very similar.

Did multi cloud redundancy end up being too expensive? Tech didn't line up enough? No good business case?

The elastic cloud story that never was? https://www.slideshare.net/slideshow/pets-vs-cattle-the-elas...

What happened?

[-]

LaurensBER 5 hours ago

The (cognitive) overhead of managing and deploying to multiple clouds usually isn't worth it for most teams. Hiring experts and maintaining knowledge about the ins and outs of two (or more) clouds is less feasible for small, fast moving teams.

Simplicity is linked to uptime and having a single cloud solution is a simpeler solution.

For large companies, its mostly cost savings. Easier to negotiate a good discount at N million versus N/2 million.

Besides that no-one ever got fired for picking AWS ;)

tadfisher 5 hours ago

Not a justifiable expense when no one else is resilient against their AWS region going down either. Also cross-cloud orchestration is quite dead because every provider is still 100% proprietary bullshit and the control plane is... kubernetes. We settled for kubernetes.

[-]

morshu9001 4 hours ago

Also if you can't even do cross region, cross cloud won't happen

[-]

dylan604 4 hours ago

Cross region isn't simple when you have terabytes of storage in buckets in a region. Building services in other regions without that data doesn't really do any good. Maintaining instances in various regions is easy, but it's that data that complicates everything. If you need to use the instances in a different region because your main region is down, you still can't do anything because those cross region instances can't access the necessary data.

[-]

Veserv 2 hours ago

Entire terabytes?! My god, I can only barely fit that onto a single SD card the size of my pinky nail.

It is quite bizarre that such paltry amounts of data and problems with such tiny scale seem to pose challenging problems when done in the cloud.

[-]

dylan604 2 hours ago

Such a sophomoric response. It does not matter how large your storage use is exactly. The point is that nobody is going to pay to replicate that data in multiple clouds or within multiple regions of the same cloud provider.

Btw, I'd love to have a link to where I could buy an SD card the size of a pinky nail that holds terabytes of data.

[-]

Veserv an hour ago

It absolutely matters how large your storage use is. Terabytes of storage is easily manageable on even basic consumer hardware. Terabytes of storage costs just hundreds of dollars if you are not paying the cloud tax.

If you got resiliency and uptime for a extra hundred dollars a year, that would be a no-brainer for any commercial operation. The byzantine kafkaesque horror of the cloud results in trivial problems and costs ballooning into nearly insurmountable and cost-ineffective obstacles.

These are not hard or costly problems or difficult scales. They have been made hard and costly and difficult.

BenjiWiebe 23 minutes ago

Yes they exaggerated, it takes several pinky nail sized cards to store several TB. Only 1TB per microSD.

jimbokun 2 hours ago

Data has a lot of gravity.

TZubiri 4 hours ago

Bottomline is that AWS gives you the tools to survive this outage within their own ecosystem.

If there's an issue with relying only on AWS it has not been expressed in this outage.

[-]

dylan604 4 hours ago

exactly what tools helps make your large volume of data stored in a down region available to other regions without duplicating the monthly storage fees?

[-]

morshu9001 4 hours ago

You duplicate the fees. But it's the same or worse trying to do multi cloud.

[-]

dylan604 3 hours ago

Which is precisely why it's not done

[-]

neom 3 hours ago

I seems to recall it was fairly common to have a read only versions of sites when there was a major outage - we did that a lot with deviantart in the early 2000s, did that fall out of favour or too complex with modern stacks or?

[-]

dylan604 2 hours ago

If only everything was a simple website. You're totally ignoring other types of workflows that would be impossible to use a read-only fall back. Not just impossible, but pointless.

toast0 4 hours ago

It seems that clouds balance their budget on egress charges... which leads to cross cloud communication being too expensive to setup multi cloud redundancy. Cross region redundancy is often too expensive too. Even cross availability zones is too expensive for some clouds and applications. (Cross region redundancy in a single cloud doesn't always work out, if the cloud has an outage on a global subsystem, or the broken subsystem gets pushed to multiple regions before exhibiting symptoms)

Additionally, moving your load to a different cloud can be challenging while one is down. It ends up being a lot of work that pays off for a few hours a year. For a lot of applications, it's better to just suffer the downtime and spend money on other things.

dylan604 4 hours ago

If you're a company providing services to people that already have data stored in VendorA's cloud, being on a different cloud would be expensive and prevent you from winning much work. If it turns out that VendorA happens to be the vendor for your clients, you build your services to run on VendorA's cloud too.

This is the situation for my company that started with the intent of being platform agnostic, but it quickly became much less complex as all of the potential client pool was using the same cloud. People with buckets with large amounts of data are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors.

[-]

conductr 3 hours ago

> are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors

Because it rarely is. Occasional downtime is just a cost of doing business. It is, or should be, rare enough that you just take it as it comes instead of trying to have a redundancy. We don't build tunnels everywhere as a backup for surface roads on snowy days. We just cancel school and work for the day and make up for it later. Do some important things get impacted? Sure, but most things are as mission critical as we make them out to be. The press coverage of an AWS outage makes it so easy to shrug it off and point fingers.

justapassenger 4 hours ago

There’s a huge difference between “similar” and “works and is ROI positive for my business across the whole lifecycle”.

Multi cloud redundancy is like Java being a solution to platform independency.

Analemma_ 5 hours ago

All the cloud providers have cheap compute but ludicrously expensive network egress. Trying to multicloud will stick you with a massive traffic bill, which is probably not a coincidence.

[-]

starman55 35 minutes ago

It really depends on how you will built it. You can architect it for multi cloud from top down where the client/browser talk to one region, With DNS with health check, and replication happens at the DB layer. Your services don't talk cross region at service level, so avoiding a lot of cross region/cloud communication. Most use cases can be addressed this way.

jamesblonde 4 hours ago

It's a market regulation failure. Which results in a failed market, with the cloud infra provider also providing data services. 20 years ago, there were 20+ widely used operational databases. Now, it's like DynamoDB with like half the market.

[-]

conductr 3 hours ago

How should this have played out in a regulated market? DynamoDB gets released, then what? Has limits on the market share it's allowed to steal?

Should we similarly cap say Front End frameworks on market penetration / growth? Is react too big to fail? Do we need to force some of it's users to use something else?

jimbokun 2 hours ago

What would these regulations say, exactly?

sumtechguy 5 hours ago

Many companies idea of a disaster plan is to make it after the disaster.

You have to build it in. That takes time money and training. Do you do failovers? Do they work? What is your backup situation? What is your list of work items to do during the failover? How long does it take? Do you even HAVE a failover plan? Can your services handle being in 'split brain'? Do you have specialty services that can only run in one place?

The unfortunate reality is this planning happens many times too late.

TZubiri 4 hours ago

It feels like a hat on a hat, cloud systems are already designed for redundancy, adding a redundant layer on top of that is like a double condom, or invesisting in multiple investment funds.

rubiquity 5 hours ago

Networking leaving the cloud provider (or even just to another zone on the same cloud) is $0.02 GB. That adds up real fast.

cpncrunch 4 hours ago

"The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers."

https://health.aws.amazon.com/health/status?path=service-his...

dijit 5 hours ago

And we lean into it by saying "Well, if everyone else is down, I get a free pass".

(which, is not true in reality if you have ordinary customers).

KronisLV 3 hours ago

So, how many people will actually switch their setups to multi-cloud as a consequence of this? How many will move over to self-hosting? Or will they just do a post-incident report, wave hands around and do nothing?

Because I think it's very much the same way as it is with Cloudflare - while the large vendors aren't always openly hostile, we can just smile and hope that they don't get too keen on reminding us that they're holding us hostage.

I don't see that changing anytime soon. I've personally also used Hetzner, Contabo, Scaleway, Vultr, DigitalOcean, Time4VPS and some other platforms, but when people couple their setups to CF/AWS/GCP/Azure, typically that coupling is hard to get rid of and doing so is hard to justify.

[-]

SkyPuncher 3 hours ago

For most companies, I suspect this will actually re-affirm _not_ switching to multi-cloud.

Lots of businesses who will be completely forgotten as having an outage today because all of their customers were dealing with their own outages and outages in dozens of other providers.

Obviously, that doesn't fly for everyone.

1970-01-01 3 hours ago

GCP and Azure should be running a 10% sale/discount (Coupon code: RAINYDAY) for new accounts during the week of an AWS outage. The bean counters would take note.

jimbokun 2 hours ago

Nobody ever got fired for buying IBM…

…no, Microsoft…

…no, AWS.

xp84 3 hours ago

In 2011 there was some kind of big outage at some major AWS US-east pop. I started a job at a company (very boring B2C startup) which had taken the lesson from that, that "cloud anything is dangerous."

They went and bought a bunch of literal servers and installed them in a datacenter, 90 miles away from our offices, and this is where all our applications ran for the remainder of that company's existence (about 6 more years). For the whole time I was at that company, we had somewhat more, and usually more lengthy, outages than the average startup. The only difference is that when some piece of networking gear took a crap, or a disk failed, or whatever, our guys had to diagnose and resolve it (Their karma, I guess, since this was their idea).

Anyway, I do think it would be good if at least so-calld 'tech companies' had a little less obsession to outsource everything -- even easy things -- to AWS, GCP, and Azure. I feel that way mainly for cost reasons as many of these services are wildly overpriced. But also we shouldn't kid ourselves by ignoring the advantages of operating at the scale those guys do. They can afford to have multiple absolute wizards available around the clock who make sure that when a problem happens, it's not the kind of "S-show" we had at my old company where we're all on a slack room or zoom or whatever and just guessing at to try for half an hour before we can figure out what the actual issue is.

[-]

robomc 2 hours ago

This. And when a service goes down it's a lot easier to explain to your client/boss that "half the internet is down" than "our boutique solution is broken so it's just us actually".

999900000999 3 hours ago

I largely agree with you. When AWS goes down, for most situations I can just go outside and smoke a cigarette and not worry about it.

It's someone else's problem.

jimmar 4 hours ago

My company has been ahead of all of this by causing outages in our own data center without waiting for the cloud to do it for them.

On a serious note, resiliency takes effort and investment no matter where you host your content.

Aeolun 4 hours ago

It’s only a single region. If anything it shows how many people just double down on the default without any redundancy.

[-]

arbll 4 hours ago

A single region that is a SPOF for global AWS services*

[-]

starman55 31 minutes ago

Is us-east-2 services impacted today? which ones?

stronglikedan 4 hours ago

> It’s only a single region

Which was effectively the only region

binary132 5 hours ago

Wow, thanks experts! I never could have figured this out without you :)))

[-]

patrickmcnamara 4 hours ago

This article isn't written for you. It's written for my mom, etc.

[-]

hippo77 2 hours ago

Surprised to see an article like that even getting shared here. The Guardian seems to be wrong on almost every tech issue.

binary132 2 hours ago

Does your mother frequent hackernews?

esafak 4 hours ago

There has to be an Onion article for this.

[-]

01HNNWZ0MV43FF 4 hours ago

"No way to prevent this, says only region where this regularly happens"

JadoJodo 3 hours ago

> "Also in the UK, Ring users complained on social media that their doorbells were not working."

I sincerely hope that the base functionality of these doorbells (i.e., triggering the ringing of the bell within the home) is preserved in the event of an internet outage.

dabinat 3 hours ago

This is coming right after we switched back to AWS after trying to switch storage to Cloudflare R2. Even with this outage, I still consider AWS more reliable than Cloudflare.

physicsguy 5 hours ago

We don’t use AWS at work but we still experienced disruption because lots of our customers do, and use it to transfer data to us. That means we then saw an uplift in data transfers as their systems came back online.

There is no panacea. The reason many people use these is because it’s easy and hard to find people that know other clouds and their quirks.

racl101 5 hours ago

I find it weird many people are just realizing this. I've had this conversation with regards to talking about what should happen if a couple of bad earth quakes, not even "the big one", were to occur.

But on the other hand, maybe I hang around too many tech people to not empathically understand the other point of view.

[-]

morshu9001 5 hours ago

We've seen big outages already but nothing that lasts too long. If an outage became prolonged enough, people would find solutions. We don't know what this massive outage would even look like, so whatever preparation you do, it might still break.

Also there are some outages that affect real life like airlines, but tech news overstates some like Facebook. It turns out that FB and IG can be totally broken for a whole day, the world will keep spinning, and they won't even lose users.

jraph 4 hours ago

I think many (most?) non tech people don't even know that Amazon is first and foremost a cloud provider (and one of the biggest at that, if not the biggest) and that its market thing is almost a side activity at this point.

bee_rider 5 hours ago

US east is pretty geologically stable I think.

bix6 5 hours ago

Can someone educate me on the solution to this?

I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least. If you have budget maybe you deploy to multiple providers for redundancy? But that increases cost and complexity.

Who’s going to bother with colo given the cost / complexity? Who’s going to run a server from their office given ISP restrictions and downtime fears?

What is the realistic antidote here?

[-]

98codes 5 hours ago

Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.

To fix it, test your failback procedures. For everything else, there's nothing to fix, it's working by design.

[-]

maccard 5 hours ago

> Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.

My CI was down for 2 hours this morning, despite not even being on AWS. We have a set of credentials on that host that we call assumeRole with and push to an S3 bucket, which has a lambda that duplicates to buckets in other regions. All our IAM calls were failing due to this outage, and we have 0 items deployed in us-east-1 (we're european)

[-]

cyberax 4 hours ago

You likely used a us-east-1 IAM endpoint instead of a regionalized one ( https://aws.amazon.com/blogs/security/how-to-use-regional-aw... ). We've been using it, and we're not experiencing any issues whatsoever in us-east-2.

One thing that AWS should do is provide an easier way to detect these hidden dependencies. You can do that with CloudTrail if you know how to do it (filter operations by region and check that none are in us-east-1), but a more explicit service would be nice.

ruuda 4 hours ago

Rent servers from a local provider. It's cheaper, you get more control over the hardware, but most of all, it avoids correlated failures.

[-]

jimbokun 2 hours ago

That only helps if their uptime is better than AWS.

cheeze 4 hours ago

On the flipside, then you have to maintain instances of everything.

For most of what I run these days, I'd rather just have someone else run and administer my database. Same with my load balancers. And my Kubernetes cluster. I don't really care if there is an outage every 2 years.

gytisgreitai 5 hours ago

What cost? Complexity - yes, to some extent.

tayo42 4 hours ago

If the cost is worth the complexity then you just do it. Otherwise you don't. How much did a company lose today compared to how much it costs to set it up

And colo and datacenters aren't immune to going down

0xbadcafebee 5 hours ago

This is what I call "fool's availability": reducing single points of failure (one cloud provider) without adding any actual redundancy.

If you removed AWS/GCP/Azure/etc and just had 100 small providers scattered all over, the result would be hundreds of outages throughout the year, as opposed to one big outage every other year [in one region]. AWS is already way more reliable than any other provider.

The real problem here is that companies that use AWS are morons who don't know how to architect/build infrastructure properly.

If it's important, it should be built right, regardless of who the provider is. A software building code would mandate how companies could use infrastructure (AWS or any provider) so that important services would not go down when one service or region goes down.

This is the basic concept behind things like the electrical code. It doesn't matter how great a public utility is; if your business is wired up so badly that a stiff breeze sets it on fire, just switching utilities isn't gonna help. And some utilities do occasionally have problems that persist down their lines to the customers, so customers need to set up equipment to protect against those failures. Whole-house surge protectors, lightning arresters, EMP shields, etc are necessary so that a rare event doesn't fry expensive customer equipment.

[-]

morshu9001 4 hours ago

Yes but most of those companies aren't morons, they're just taking an acceptable risk. Multi-region or multi-cloud setup is nontrivial.

dudeinjapan 5 hours ago

Its probably worse—a given stack using multiple of these small providers will probably have more “single points of failure” (providers used in series rather than parallel.)

(If most companies liked using cloud providers in parallel, they’d already be doing it today between AWS, Azure, and GCP.)

greenavocado 4 hours ago

The only reason we can't leave AWS is because we have 500 terabytes of data in S3

[-]

jewel 4 hours ago

Talk to the other vendors. I know of a place that had about that same amount and decided to have a redundant copy of all of their data in another vendor's S3-compatible product. That vendor paid for all of their egress fees as long as they signed a 12-month contract and used their tool for the migration.

[-]

coredog64 3 hours ago

AWS will credit your egress fees if you incur them via leaving.

https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...

ovaistariq 3 hours ago

What other AWS services do you depend on?

[-]

greenavocado 3 hours ago

Mostly EC2 for data mining terabytes of historical data stored in S3. Production usage is fairly lightweight compared to the EC2 and S3 stuff. We did cut our bill a lot by moving to single AZ redundancy.

labrador 5 hours ago

This new post is interesting: https://news.ycombinator.com/item?id=45646777

"October 17, 2025, was my last day at Amazon Web Services... CloudFront is a CDN, a content delivery network, or, simply put, a large distributed cache for your cat photos. And a very successful one. Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh? In practice, this means that with any change, you have a chance of crashing 30% of the internet."

[-]

Yokolos 5 hours ago

Ngl, that sounds like my dream job.

ta1243 4 hours ago

> Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh?

No. No its not. But tech enthusiasts on HN and Reddit love it.

(Another 30% runs through cloudflare)

halis 5 hours ago

Just need to retire the us-east-1 region, it's becoming a meme at this point.

boznz 4 hours ago

I've really got to get me one of these 'expert' job gigs!

saltysalt 3 hours ago

AWS is this generation's mainframe. /joking

bitpatch 3 hours ago

The "experts" should lay out a good alternative in that case. Smaller providers also run into outages.

[-]

dexterdog 19 minutes ago

And they all get to claim that they have better uptime to potential customers because nobody other than their current customers remembers their outages.

spullara 2 hours ago

providers should stop using just us-east-1 like idiots.

midtake 3 hours ago

There are many public clouds and VPS providers out there. Who the fuck are these experts?

The real issue is that business pricks will cut costs and single-homing in a single availability zone will be the only workable solution.

On top of that, infrastructure ops are seen as a nuisance who get in the way of the sexy stuff like shipping your latest code changes now. If you complicate the ops pipeline that gets in the way of sexy dev work. So fuck that just ship lol!

heavyset_go 4 hours ago

It makes us vulnerable to a centrality attack either foreign or domestic. If someone wants to fuck society up, only a handful of data centers, routers, networking junctions, etc could do it.

Jzush 5 hours ago

If only there was a system of computers on the Internet that was distributed across the world where we could host things instead of all in one location. We could call it the "cloud".

[-]

ta1243 4 hours ago

We could connect distributed computers on distributed networks together using some form of internetworking protocol.

judahmeek 2 hours ago

I recall reading that when the costs of distribution (but not the costs of discoverability) are low, generally you end up with a power law sort of distribution of consumers to providers, where provider #1 has exponentially more market share than provider #2 and provider #2 has exponentially more market share than provider #3, #4, etc.

Examples of this are Windows/Mac, McDonalds/Burger King, Playstation/Xbox, Nvidia/?, AWS/Azure?, Android/iPhone, etc...

Basically, the majority of users all using the same dependency/platform/product is basic economics.

ChrisArchitect 5 hours ago

More discussion: https://news.ycombinator.com/item?id=45640838

shadowgovt 4 hours ago

Sure. Are the "experts" going to pony up the cash to build in redundancy, or change the market fundamentals that make it make more sense for a startup to rush to product on a shoestring and then keep adding features instead of building against not-yet-happened failure modes?

If not, I look forward to the next single-point-of-failure outage. And the next. And the next.