Just a couple of days ago in this HN thread [0] there were quite some users claiming Hetzner is not an options as their uptime isn't as good as AWS, hence the higher AWS pricing is worth the investment. Oh, the irony.
I only got €100.000 bounded to a year, then a 20% discount for spend in the next year.
(I say "only" because that certainly would be a sweeter pill, €100.000 in "free" credits is enough to make you get hooked, because you can really feel the free-ness in the moment).
Simultaneously too confused to be able to make their own UX choices, but smart enough to understand the backend of your infrastructure enough to know why it doesn't work and excuses you for it.
The morning national TV news (BBC) was interrupted with this as breaking news, and about how many services (specifically snapchat for some reason) are down because of problems with "Amazon's Web Services, reported on DownDetector"
I thought we didn't like when things were "too big to fail" (like the banks being bailed out because if we didn't the entire fabric of our economy would collapse; which emboldens them to take more risks and do it again).
100%. When AWS was down, we'd say "AWS is down!", and our customers would get it. Saying "Hetzner is down!" raises all sorts of questions your customers aren't interested in.
I've ran a production application off Hetzner for a client for almost a decade and I don't think I have had to tell them "Hetzner is down", ever, apart from planned maintenance windows.
As a data point, I've been running stuff at Hetzner for 10 years now, in two datacenters (physical servers). There were brief network outages when they replaced networking equipment, and exactly ONE outage for hardware replacement, scheduled weeks in advance, with a 4-hour window and around 1-2h duration.
It's just a single data point, but for me that's a pretty good record.
It's not because Hetzner is miraculously better at infrastructure, it's because physical servers are way simpler than the extremely complex software and networking systems that AWS provides.
You can argue about Hetzner's uptime, but you can 't argue about Hetzner's pricing which is hands down the best there is. I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.
I switched to netcup for even cheaper private vps for personal noncritical hosting. I'd heard of netcup being less reliable but so far 4 months+ uptime and no problems. Europe region.
Hetzner has the better web interface and supposedly better uptime, but I've had no problems with either. Web interface not necessary at all either when using only ssh and paying directly.
I've been running my self-hosting stuff on Netcup for 5+ years and I don't remember any outages. There probably were some, but they were not significant enough for me to remember.
Exactly. Hetzner is the equivalent of the original Raspberry Pi. It might not have all fancy features but it delivers and for the price that essentially unblocks you and allows you to do things you wouldn't be able to do otherwise.
> I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.
Comments like this are so exaggerated that they risk moving the goodwill needle back to where it was before. Hetzner offers no service that is similar to DynamoDB, IAM or Lambda. If you are going to praise Hetzner as a valid alternative during a DynamoDB outage caused by DNS configuration, you would need to a) argue that Hetzner is a better option regarding DNS outages, b) Hetzner is a preferable option for those who use serverless offers.
I say this as a long-time Hetzner user. Herzner is indeed cheaper, but don't pretend that Herzner let's you click your way into a highly-availale nosql data store. You need non-trivial levels of you're ow work to develop, deploy, and maintain such a service.
> but don't pretend that Herzner let's you click your way into a highly-availale nosql data store.
The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda - is something I've only heard from people who've done AWS quickstarts but never run anything at scale in AWS.
Of course nobody else offers AWS products, but people use AWS for their solutions to compute problems and it can be easy to forget virtually all other providers offer solutions to all the same problems.
Are you Netflix? Because is not theres a 99% probability you dont need any of those AWS services and just have a severe case of shiny object syndrome in your organisation.
Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.
To be honest I don't trust myself running a HA PostgreSQL setup with correct backups without spending an exorbitant effort to investigate everything (weeks/months) - do you ? I'm not even sure what effort that would take. I can't remember last time I worked with unmanaged DB in prod where I did not have a dedicated DBA/sysadmin. And I've been doing this for 15 years now. AFAIK Hetzner offers no managed database solution. I know they offer some load balancer so there's that at least.
At some point in the scaling journey bare metal might be the right choice, but I get the feeling a lot of people here trivialize it.
> Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.
And almost all of them need a database, a load balancer, maybe some sort of cache. AWS has got you covered.
Maybe some of them need some async periodic reporting tasks. Or to store massive files or datasets and do analysis on them. Or transcode video. Or transform images. Or run another type of database for a third party piece of software. Or run a queue for something. Or capture logs or metrics.
And on and on and and on. AWS has got you covered.
This is Excel all over again. "Excel is too complex and has too many features, nobody needs more than 20% of Excel. It's just that everyone needs a different 20%".
If you need the absolutely stupid scale DynamoDB enables what is the difference compared to running for example FoundationDb on your own using Hetzner?
TBH, in my last 3 years with Hetzner, i never saw a downtime to my servers other than myself doing some routin maitenance for os updates. Location Falkenstein.
You really need your backup procedures and failover procedures though, a friend bought a used server and the disk died fairly quickly leaving him sour.
We've been running our services on Hetzner for 10 years, never experienced any significant outages.
That might be datacenter dependant of course, since our root servers and cloud services are all hosted in Europe, but I really never understood why Hetzner is said to be less reliable
I work at a small / medium company with about ~20 dedicated servers and ~30 cloud servers at Hetzner. Outages have happened, but we were lucky that the few times it did happen, it was never a problem / actual downtime.
One thing to note is that there are some scheduled maintenances were we needed to react.
I'm not affiliated and won't be compensated in any way for saying this: Hetzner are the best business partners ever. Their service is rock solid, their pricing is fair, their support is kind and helpful.
Going forward I expect American companies to follow this European vibe, it's like the opposite of enshitification.
Stop making things up. As someone who commented on the thread in favour of AWS, there is almost no mention of better uptime in any comment I could find.
I could find one or two downvoted or heavily critisized comments, but I can find more people mentioning the opposite.
Well, we have a naming issue (Hetzner also has Hetzner Cloud, it looks people still equal cloud with the three biggest public cloud providers).
In any case, in order for this to happen, someone would have to collect reliable data (not all big cloud providers like to publish precise data, usually they downlplay the outages and use weasel words like "some customers... in some regions... might have experienced" just not to admit they had an outage) and present stats comparing the availability of Heztner Cloud vs the big three.
AWS and Cloudflare are HN darlings. Go so far as to even suggest a random personal blog doesn't need Cloudflare and get downvoted with inane comments as "but what about DDOS protection?!"
The truth is one under the age of 35 is able to configure a webserver any more, apparently. Especially now that static site generators are in vogue and you don't even need to worry about php-fpm.
“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”
DNS strikes me as the kind of solution someone designed thinking “eh, this is good enough for now. We can work out some of the clunkiness when more organizations start using the Internet.” But it just ended up being pretty much the best approach indefinitely.
All the big leagues take "piracy" very seriously and constantly try to clamp down on it.
TV rights is one of their main revenue sources, and it's expected to always go up, so they see "piracy" as a fundamental threat. IMO, it's a fundamental misunderstanding on their side, because people "pirating" usually don't have a choice - either there is no option for them to pay for the content (e.g. UK's 3pm blackout), or it's too expensive and/or spread out. People in the UK have to pay 3-4 different subscriptions to access all local games.
The best solution, by far, is what France's Ligue 1 just did (out of necessity though, nobody was paying them what they wanted for the rights after the previous debacles). Ligue 1+ streaming service, owned and operated by them which you can get access through a variety of different ways (regular old TV paid channel, on Amazon Prime, on DAZN, via Bein Sport), whichever suits you the best. Same acceptable price for all games.
As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?
To my understanding the main problem is DynamoDB being down, and DynamoDB is what a lot of AWS services use for their eventing systems behind the scenes. So there's probably like 500 billion unprocessed events that'll need to get processed even when they get everything back online. It's gonna be a long one.
I wonder how many companies have properly designed their clients. So that the timing before re-attempt is randomised and the re-attempt timing cycle is logarithmic.
> I visited the Berlin Wall. People at the time wondered how long the Wall might last. Was it a temporary aberration, or a permanent fixture of modern Europe? Standing at the Wall in 1969, I made the following argument, using the Copernican principle. I said, Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here. My visit is random in time. So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters—that means, not in the first quarter and not in the fourth quarter.
> Let’s suppose that I’m at the beginning of that middle fifty percent. In that case, one-quarter of the Wall’s ultimate history has passed, and there are three-quarters left in the future. In that case, the future’s three times as long as the past. On the other hand, if I’m at the other end, then three-quarters have happened already, and there’s one-quarter left in the future. In that case, the future is one-third as long as the past.
> So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters
Note that this is equivalent to saying "there's no way to know". This guess doesn't give any insight, it's just the function that happens to minimize the total expected error for an unknowable duration.
Their status page (https://health.aws.amazon.com/health/status) says the only disrupted service is DynamoDB, but it's impacting 37 other services. It is amazing to see how big a blast radius a single service can have.
It's not surprising that it's impacting other services in the region because DynamoDB is one of those things that lots of other services build on top of. It is a little bit surprising that the blast radius seems to extend beyond us-east-1, mind.
In the coming hours/days we'll find out if AWS still have significant single points of failure in that region, or if _so many companies_ are just not bothering to build in redundancy to mitigate regional outages.
I'm real curious how much of AWS GovCloud has continued through this actually. But even if it's fine, from a strategic perspective how much damage did we just discover you could do with a targeted disruption at the right time?
AWS engineers are trained to use their internal services for each new system. They seem to like using DynamoDB. Dependencies like this should be made transparent.
Not sure why this is downvoted - this is absolutely correct.
A lot of AWS services under the hood depend on others, and especially us-east-1 is often used for things that require strong consistency like AWS console logins/etc (where you absolutely don't want a changed password or revoked session to remain valid in other regions because of eventual consistency).
Not "like using", they are mandated from the top to use DynamoDB for any storage. At my org in the retail page, you needed director approval if you wanted to use a relational DB for a production service.
Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversity workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
We just had a power outage in Ashburn starting at 10 pm Sunday night. It restored at 3:40am ish, and I know datacenters have redundant power sources but the timing is very suspicious. The AWS outage supposedly started at midnight
Even with redundancy, the response time between NYC and Amazon East in Ashburn is something like 10 ms. The impedance mismatch and dropped packets and increased latency would doom most organizations craplications.
Careful: NPM _says_ they're up (https://status.npmjs.org/) but I am seeing a lot of packages not updating and npm install taking forever or never finishing. So hold off deploying now if you're dependent on that.
They've acknowledged an issue now on the status page. For me at least, it's completely down, package installation straight up doesn't work. Thankfully current work project uses a pull-through mirror that allows us to continue working.
Amazing, I wonder what their interview process is like, probably whiteboarding a next-gen LLM in WASM, meanwhile, their entire website goes down with us-east-1... I mean.
Well, but tomorrow there will be CTOs asking for a contingency plan if AWS goes down, even if planning, preparing, executing and keeping it up to date as the infra evolves will cost more than the X hours of AWS outage.
There are certainly organizations for which that cost is lower than the overall damage of services being down due to AWS fault, but tomorrow we will hear CTOs from smaller orgs as well.
It's so true it hurts.
If you are new in any infra/platform management position you will be scared as hell this week. Then you will just learn that feeling will just disappear by itself in a few days.
No really true for large systems. We are doing things like deploying mitigations to avoid scale-in (eg services not receiving traffic incorrectly autoscaling down), preparing services for the inevitable storm, managing various circuit breakers, changing service configurations to ease the flow of traffic through the system, etc. We currently have 64 engineers in our on-call room managing this. There's plenty of work to do.
I feel bad for the people impacted by the outage. But at the same time there's a part of me that says we need a cataclysmic event to shake the C-Suite out of their current mindset of laying off all of their workers to replace them with AI, the cheapest people they can find in India, or in some cases with nothing at all, in order to maximize current quarter EPS.
After some thankless years preventing outages for a big tech company, I will never take an oncall position again in my life.
Most miserable working years I have had. It's wild how normalized working on weekends and evenings becomes in teams with oncall.
But it's not normal. Our users not being able to shitpost is simply not worth my weekend or evening.
And outside of Google you don't even get paid for oncall at most big tech companies! Company losing millions of dollars an hour, but somehow not willing to pay me a dime to jump in at 3AM? Looks like it's not my problem!
When I used to be on call for Cisco WebEx services. I got paid extra, and got extra time of. Even if nothing happened. In addition we where enough people on the rotation, so I didn't have to do it that often.
I believe the rules varied based on jurisdiction, and I think some had worse deals, and some even better. But I was happy with our setup in Norway.
Tbh I do not think we would have had, what we had if it wasn't for the local laws and regulations. Sometimes worker friendly laws can be nice.
Follow the sun does not happen by itself. Very few if any engineering teams are equally split across thirds of the globe in such a way that (say) Asia can cover if both EMEA and the Americas are offline.
Having two sites cover the pager is common, but even then you only have 16 working hours at best and somebody has to take the pager early/late.
> But this is not normal. Our users not being able to shitpost is simply not worth my weekend or evening.
It is completely normal for staff to have to work 24/7 for critical services.
Plumbing, HVAC, power plant engineers, doctors, nurses, hospital support staff, taxi drivers, system and network engineers - these people keep our modern world alive, all day, every day. Weekends, midnights, holidays, every hour of every day someone is AT WORK to make sure our society functions.
Not only is it normal, it is essential and required.
It’s ok that you don’t like having to work nights or weekends or holidays. But some people absolutely have to. Be thankful there are EMTs and surgeons and power and network engineers working instead of being with their families on holidays or in the wee hours of the night.
Nice try at guilt-tripping people doing on-call, and doing it for free.
But to parent's points: if you call a plumber or HVAC tech at 3am, you'll pay for the privilege.
And doctors and nurses have shifts/rotas. At some tech places, you are expected to do your day job plus on-call. For no overtime pay. "Salaried" in the US or something like that.
Yup, that is precisely what I did and what I'm encouraging others to do as well.
Edit: On-call is not always disclosed. When it is, it's often understated. And finally, you can never predict being re-orged into a team with oncall.
I agree employees should still have the balls to say "no" but to imply there's no wrongdoing here on companies' parts and that it's totally okay for them to take advantage of employees like this is a bit strange.
Especially for employees that don't know to ask this question (new grads) or can't say "no" as easily (new grads or H1Bs.)
If you or anyone else are doing on-call for no additional pay, precisely nobody is forcing you to do that. Renegotiate, or switch jobs. It was either disclosed up front or you missed your chance to say “sorry, no” when asked to do additional work without additional pay. This is not a problem with on call but a problem with spineless people-pleasers.
Every business will ask you for a better deal for them. If you say “sure” to everything you’re naturally going to lose out. It’s a mistake to do so, obviously.
An employee’s lack of boundaries is not an employer’s fault.
> It is completely normal for staff to have to work 24/7 for critical services.
> Not only is it normal, it is essential and required.
Now you come with the weak "you don't have to take the job" and this gem:
> An employee’s lack of boundaries is not an employer’s fault.
As if there isn't a power imbalance, or employers always disclose everything or chance their mind. But of course, let's blame those entitled employees!
No one dies if our users can't shitpost until tomorrow morning.
I'm glad there are people willing to do oncall. Especially for critical services.
But the software engineering profession as a whole would benefit from negotiating concessions for oncall. We have normalized work interfering with life so the company can squeeze a couple extra millions from ads. And for what?
Nontrivial amount of ad revenue lost? Not my problem if the company can't pay me to mitigate.
> Nontrivial amount of ad revenue lost? Not my problem if the company can't pay me to mitigate.
Interestingly, when I worked on analytics around bugs we found that often (in the ads space), there actually wasn't an impact when advertisers were unable to create ads, as they just created all of them when the interface started working again.
Now, if it had been the ad serving or pacing mechanisms then it would've been a lot of money, but not all outages are created equal.
Not all websites are for shitposting. I can’t talk to my clients for whom I am on call because Signal is down. I also can’t communicate with my immediate family. There are tons of systems positively critical to society downstream from these services.
Trouble is one can't fully escape us-east-1. Many services are centralized there like: S3, Organizations, Route53, Cloudfront, etc. It is THE main region, hence suffering the most outages, and more importantly, the most troubling outages.
We're mostly deployed on eu-west-1 but still seeing weird STS and IAM failures, likely due to internal AWS dependencies.
Also we use Docker Hub, NPM and a bunch of other services that are hosted by their vendors on us-east-1 so even non AWS customers often can't avoid the blast radius of us-east-1 (though the NPM issue mostly affects devs updating/adding dependencies, our CI builds use our internal mirror)
FYI:
1. AWS IAM mutations all go through us-east-1 before being replicated to other public/commercial regions. Read/List operations should use local regional stacks. I expect you'll see a concept of "home region" give you flexibility on the write path in the future.
2. STS has both global and regional endpoints. Make sure you're setup to use regional endpoints in your clients https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credenti...
us-east-1 was, probably still is, AWS' most massive deployment. Huge percentage of traffic goes through that region. Also, lots of services backhaul to that region, especially S3 and CloudFront. So even if your compute is in a different region (at Tower.dev we use eu-central-1 mostly), outages in us-east-1 can have some halo effect.
This outage seems really to be DynamoDB related, so the blast radius in services affected is going to be big. Seems they're still triaging.
I don't recommend to my clients they use us-east-1. It's the oldest and most prone to outages. I usually always recommend us-east-2 (Ohio) unless they require West Coast.
>Oct 20 12:51 AM PDT We can confirm increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. This issue may also be affecting Case Creation through the AWS Support Center or the Support API. We are actively engaged and working to both mitigate the issue and understand root cause. We will provide an update in 45 minutes, or sooner if we have additional information to share.
Weird that case creation uses the same region as the case you'd like to create for.
This is from Amazon's latest earnings call when Andy Jessy was asked why they aren't growing as much as there competitors
"I think if you look at what matters to customers, what they care they care a lot about what the operational performance is, you know, what the availability is, what the durability is, what the latency and throughput is of of the various services. And I think we have a pretty significant advantage in that area."
also
"And, yeah, you could just you just look at what's happened the last couple months. You can just see kind of adventures at some of these players almost every month. And so very big difference, I think, in security."
In this moments I think devs should invest in vendor independence if they can. While I'm not to that stage yet (cloudlfare dependence) using open technologies like docker (or Kubernetes), Traefik instead of managed services can help in this disaster situations by switching to a different provider in a faster way than having to rebuild from zero.
as a disclosure I'm not still to that point on my infrastructure But I'm trying to slowly define one for my self
"Oct 20 2:01 AM PDT We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1..."
Maybe unrelated, but yesterday I went to pick up my package from an Amazon Locker in Germany, and the display said "Service unavailable". I'll wait until later today before I go and try again.
You would think that after the previous big us-east-1 outages (to be fair there have been like 3 of them in the past decade, but still, that's plenty), companies would have started to move to other AWS regions and/or to spread workloads between them.
I seem to recall other issues around this time in previous years. I wonder if this is some change getting shoe-horned in ahead of some reinvent release deadline...
I get the impression that this has been thought about to some extent, but its a constantly changing architecture with new layers and new ideas being added, so for every bit of progress there's the chance of new Single Points Of Failure being added. This time it seems to be a DNS problem with DynamoDB
Happened to be updating a bunch of NPM dependencies and then saw `npm i` freeze and I'm like... ugh what did I do. Then npm login wasn't working and started searching here for an outage, and wala.
Displaying and propagating accurate error messages is an entire science unto itself... ...I can see why it's sometimes sensible to invest resource elsewhere and fall back to 'something'.
Maybe actually making the interviews less of a hazing ritual would help.
Hell, maybe making today's tech workplace more about getting work done instead of the series of ritualistic performances that the average tech workday has degenerated to might help too.
Ergo, your conclusion doesn't follow from your initial statements, because interviews and workplaces are both far more broken than most people, even people in the tech industry, would think.
Well it looks like if companies and startups did their job in hiring the proper distributed systems skills more rather than hazing for the wrong skills we wouldn't be in this outage mess.
Many companies on Vercel don't think to have a strategy to be resilient to these outages.
I rarely see Google, Ably and others serious about distributed systems being down.
It's fun to see SRE jumping left and right when they can do basically nothing at all.
"Do we enable DR? Yes/No". That's all you can do. If you do, it's a whole machinery starting, which might take longer than the outage itself.
They can't even use Slack to communicate - messages are being dropped/not sent.
And then we laugh at the South Koreans for not having backed up their hard drives (which got burnt by actual fire, a statistically way less occurring event than an AWS outage). OK that's a huge screw up, but hey, this is not insignificant either.
What will happen now? Nothing, like nothing happened after Crowdstrike's bug last year.
AWS CodeArtifact can act as a proxy and fetch new packages from npm when needed. A bit late for that though but sharing if you want to future proof against the yearly us-east-1 outage
1) GDPR is never enforced other than token fines based on technicalities. The vast majority of the cookie banners you see around are not compliant, so it the regulation was actually enforced they'd be the first to go... and it would be much easier to go after those (they are visible) rather than audit every company's internal codebases to check if they're sending data to a US-based provider.
2) you could technically build a service that relies on a US-based provider while not sending them any personal data or data that can be correlated with personal data.
Let's be nice. I'm sure devs and ops are on fire right now, trying to fix the problems. Given the audience of HN, most of us could have been (have already been?) in that position.
Affecting Coinbase[1] as well, which is ridiculous. Can't access the web UI at all. At their scale and importance they should be multi-region if not multi-cloud.
Seems the underlying issue is with DynamoDB, according to the status page, which will have a big blast radius in other services. AWS' services form a really complicated graph and there's likely some dependency, potentially hidden, on us-east-1 in there.
It might be an interesting exercise to map how many of our services depend on us-east-1 in one way or another. One can only hope that somebody would do something with the intel, even though it's not a feature that brings money in (at least from business perspective).
It's weird that we're living in a time where this could be a taste of a prolonged future global internet blackout by adversarial nations. Get used to this feeling I guess :)
This reminds me of the twitter-based detector we had at Facebook that looked for spikes in "Facebook down" messages.
When Facebook went public, the detector became useless because it fired anytime someone wrote about the Facebook stock being down and people retweeted or shared the article.
I invested just enough time in it to decide it was better to turn it off.
Not having control or not being responsible are perhaps major selling points of cloud solutions. To each their own, I also rather have control than having to deal with a cloud provider support as a tiny insignificant customer. But in this case, we can take a break and come back once it's fixed without stressing.
Just a couple of days ago in this HN thread [0] there were quite some users claiming Hetzner is not an options as their uptime isn't as good as AWS, hence the higher AWS pricing is worth the investment. Oh, the irony.
[0]: https://news.ycombinator.com/item?id=45614922
When AWS is down, everybody knows it. People don’t really question your hosting choice. It’s the IBM of cloud era.
On the other side of that coin, I am excited to be up and running while everyone else is down!
That is 100% true. You cant be fired for picking AWS... But I doubt its the best choice for most people. Sad but true
You can't be fired, but you burn through your runway quicker. No matter which option you choose, there is some exothermic oxidative process involved.
AWS is smart enough to throw you a few mill credits to get you started.
MILL?!
I only got €100.000 bounded to a year, then a 20% discount for spend in the next year.
(I say "only" because that certainly would be a sweeter pill, €100.000 in "free" credits is enough to make you get hooked, because you can really feel the free-ness in the moment).
Schrodingers user;
Simultaneously too confused to be able to make their own UX choices, but smart enough to understand the backend of your infrastructure enough to know why it doesn't work and excuses you for it.
The morning national TV news (BBC) was interrupted with this as breaking news, and about how many services (specifically snapchat for some reason) are down because of problems with "Amazon's Web Services, reported on DownDetector"
I liked your point though!
Well, at that level of user they just know "the internet is acting up this morning"
I thought we didn't like when things were "too big to fail" (like the banks being bailed out because if we didn't the entire fabric of our economy would collapse; which emboldens them to take more risks and do it again).
Usually, 2 founders creating a startup can't fire each other anyway so a bad decision can still be very bad for lots of people in this forum
That depends on the service. Far from everyone is on their PC or smartphone all day, and even fewer care about these kinds of news.
To back up this point, currently BBC News have it as their most significant story, with "live" reporting: https://www.bbc.co.uk/news/live/c5y8k7k6v1rt
This is alongside "live" reporting on the Israel/Gaza conflict as well as news about Epstein and the Louvre heist.
This is mainstream news.
I like how their headline starts with Snapchat and Roblox being affected.
The journalist found out about it from their tween.
100%. When AWS was down, we'd say "AWS is down!", and our customers would get it. Saying "Hetzner is down!" raises all sorts of questions your customers aren't interested in.
I've ran a production application off Hetzner for a client for almost a decade and I don't think I have had to tell them "Hetzner is down", ever, apart from planned maintenance windows.
most people dont even know aws exists
And yet they still all activate their on call people (wait why do we have them if we are on the cloud?) to do .. nothing at all.
As a data point, I've been running stuff at Hetzner for 10 years now, in two datacenters (physical servers). There were brief network outages when they replaced networking equipment, and exactly ONE outage for hardware replacement, scheduled weeks in advance, with a 4-hour window and around 1-2h duration.
It's just a single data point, but for me that's a pretty good record.
It's not because Hetzner is miraculously better at infrastructure, it's because physical servers are way simpler than the extremely complex software and networking systems that AWS provides.
You can argue about Hetzner's uptime, but you can 't argue about Hetzner's pricing which is hands down the best there is. I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.
For the price of AWS you could run Hetzner, a second provider for resiliancy and still make a large saving.
Your margin is my opportunity indeed.
I switched to netcup for even cheaper private vps for personal noncritical hosting. I'd heard of netcup being less reliable but so far 4 months+ uptime and no problems. Europe region.
Hetzner has the better web interface and supposedly better uptime, but I've had no problems with either. Web interface not necessary at all either when using only ssh and paying directly.
I've been running my self-hosting stuff on Netcup for 5+ years and I don't remember any outages. There probably were some, but they were not significant enough for me to remember.
I am on Hetzner with a primary + backup server and on Netcup (Vienna) with a secondary. For DNS I am using ClouDNS.
I think I am more distributed then most of the AWS folks and it still is way cheaper.
Exactly. Hetzner is the equivalent of the original Raspberry Pi. It might not have all fancy features but it delivers and for the price that essentially unblocks you and allows you to do things you wouldn't be able to do otherwise.
They've been working pretty hard on those extra features. Their load balancing across locations is pretty decent for example.
> I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.
Comments like this are so exaggerated that they risk moving the goodwill needle back to where it was before. Hetzner offers no service that is similar to DynamoDB, IAM or Lambda. If you are going to praise Hetzner as a valid alternative during a DynamoDB outage caused by DNS configuration, you would need to a) argue that Hetzner is a better option regarding DNS outages, b) Hetzner is a preferable option for those who use serverless offers.
I say this as a long-time Hetzner user. Herzner is indeed cheaper, but don't pretend that Herzner let's you click your way into a highly-availale nosql data store. You need non-trivial levels of you're ow work to develop, deploy, and maintain such a service.
> but don't pretend that Herzner let's you click your way into a highly-availale nosql data store.
The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda - is something I've only heard from people who've done AWS quickstarts but never run anything at scale in AWS.
Of course nobody else offers AWS products, but people use AWS for their solutions to compute problems and it can be easy to forget virtually all other providers offer solutions to all the same problems.
Are you Netflix? Because is not theres a 99% probability you dont need any of those AWS services and just have a severe case of shiny object syndrome in your organisation.
Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.
To be honest I don't trust myself running a HA PostgreSQL setup with correct backups without spending an exorbitant effort to investigate everything (weeks/months) - do you ? I'm not even sure what effort that would take. I can't remember last time I worked with unmanaged DB in prod where I did not have a dedicated DBA/sysadmin. And I've been doing this for 15 years now. AFAIK Hetzner offers no managed database solution. I know they offer some load balancer so there's that at least.
At some point in the scaling journey bare metal might be the right choice, but I get the feeling a lot of people here trivialize it.
> Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.
And almost all of them need a database, a load balancer, maybe some sort of cache. AWS has got you covered.
Maybe some of them need some async periodic reporting tasks. Or to store massive files or datasets and do analysis on them. Or transcode video. Or transform images. Or run another type of database for a third party piece of software. Or run a queue for something. Or capture logs or metrics.
And on and on and and on. AWS has got you covered.
This is Excel all over again. "Excel is too complex and has too many features, nobody needs more than 20% of Excel. It's just that everyone needs a different 20%".
If you need the absolutely stupid scale DynamoDB enables what is the difference compared to running for example FoundationDb on your own using Hetzner?
You will in both cases need specialized people.
> Hetzner offers no service that is similar to DynamoDB, IAM or Lambda.
The key thing you should ask yourself: do you need DynamoDB or Lambda? Like "need need" or "my resume needs Lambda".
Well, Lambda scales down to 0 so I don't have to pay for the expensive EC2 instan... oh, wait!
TBH, in my last 3 years with Hetzner, i never saw a downtime to my servers other than myself doing some routin maitenance for os updates. Location Falkenstein.
You really need your backup procedures and failover procedures though, a friend bought a used server and the disk died fairly quickly leaving him sour.
THE disk?
It's a server! What in the world is your friend doing running a single disk???
Ate a bare minimum they should have been running a mirror.
And I have seen them delete my entire environment including my backups due to them not following their own procedures.
Sure, if you configure offsite backups you can guard against this stuff, but with anything in life, you get what you pay for.
We've been running our services on Hetzner for 10 years, never experienced any significant outages.
That might be datacenter dependant of course, since our root servers and cloud services are all hosted in Europe, but I really never understood why Hetzner is said to be less reliable
I work at a small / medium company with about ~20 dedicated servers and ~30 cloud servers at Hetzner. Outages have happened, but we were lucky that the few times it did happen, it was never a problem / actual downtime.
One thing to note is that there are some scheduled maintenances were we needed to react.
I don't have an opinion either way, but for now, this is just anecdotal evidence.
Looks fine for pointing an irony
In some ways yes. But in some ways this is like saying it's more likely to rain on your wedding day.
My reocommendation is to use AWS, but not the US-EAST-1 region. That way you get benefits of AWS without the instability.
AWS has internal dependencies on US-EAST-1.
Admittedly they're getting fewer and fewer, but they exist.
The same is also true in GCP, so as much as I prefer GCP from a technical standpoint: the truth is, if you can't see it, it doesn't mean it goes away.
We have nothing deployed in us east 1, yet all of our CI was failing due to IAM errors this morning.
I'm not affiliated and won't be compensated in any way for saying this: Hetzner are the best business partners ever. Their service is rock solid, their pricing is fair, their support is kind and helpful.
Going forward I expect American companies to follow this European vibe, it's like the opposite of enshitification.
Stop making things up. As someone who commented on the thread in favour of AWS, there is almost no mention of better uptime in any comment I could find.
I could find one or two downvoted or heavily critisized comments, but I can find more people mentioning the opposite.
Finally IT managers will start understanding that cloud is no difference than Hetzner.
Well, we have a naming issue (Hetzner also has Hetzner Cloud, it looks people still equal cloud with the three biggest public cloud providers).
In any case, in order for this to happen, someone would have to collect reliable data (not all big cloud providers like to publish precise data, usually they downlplay the outages and use weasel words like "some customers... in some regions... might have experienced" just not to admit they had an outage) and present stats comparing the availability of Heztner Cloud vs the big three.
When things go wrong, you can point at a news article and say its not just us that have been affected.
I tried that but Slack is broken and the message hasn't got through yet...
I got a downvote already for pointing this out :’)
Unfortunately, HN is full of company people, you can't talk anything against Google, Meta, Amazon, Microsoft without being downvoted to death.
Isn't it just ads?
AWS and Cloudflare are HN darlings. Go so far as to even suggest a random personal blog doesn't need Cloudflare and get downvoted with inane comments as "but what about DDOS protection?!"
The truth is one under the age of 35 is able to configure a webserver any more, apparently. Especially now that static site generators are in vogue and you don't even need to worry about php-fpm.
Can't fully agree. People genuinely detest Microsoft on HN and all over the globe. My Microsoft-related rants are always upvoted to the skies.
“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”
It’s always DNS.
I wonder how much of this is "DNS resolution" vs "underlying config/datastore of the DNS server is broken". I'd expect the latter.
... wonders if the dns config store is in fact dynamodb ...
I don’t think it is DNS. The DNS A records were 2h before they announced it was DNS but _after_ reporting it was a DNS issue.
Or expired domains which I suppose is related?
Someone probably failed to lint the zone file.
DNS strikes me as the kind of solution someone designed thinking “eh, this is good enough for now. We can work out some of the clunkiness when more organizations start using the Internet.” But it just ended up being pretty much the best approach indefinitely.
Even when it's not DNS, it's DNS.
It's always US-EAST-1 :)
Might just be BGP dressed as DNS
Oh no... may be LaLiga found out pirates hosting on AWS?
this is how I discover that is not just Serie A doing this shenanigans. I'm not really surprised
All the big leagues take "piracy" very seriously and constantly try to clamp down on it.
TV rights is one of their main revenue sources, and it's expected to always go up, so they see "piracy" as a fundamental threat. IMO, it's a fundamental misunderstanding on their side, because people "pirating" usually don't have a choice - either there is no option for them to pay for the content (e.g. UK's 3pm blackout), or it's too expensive and/or spread out. People in the UK have to pay 3-4 different subscriptions to access all local games.
The best solution, by far, is what France's Ligue 1 just did (out of necessity though, nobody was paying them what they wanted for the rights after the previous debacles). Ligue 1+ streaming service, owned and operated by them which you can get access through a variety of different ways (regular old TV paid channel, on Amazon Prime, on DAZN, via Bein Sport), whichever suits you the best. Same acceptable price for all games.
As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?
To my understanding the main problem is DynamoDB being down, and DynamoDB is what a lot of AWS services use for their eventing systems behind the scenes. So there's probably like 500 billion unprocessed events that'll need to get processed even when they get everything back online. It's gonna be a long one.
500 billions events. Always blows my mind how many people use aws
I know nothing. But I'd imagine the number of 'events' generated during this period of downtime will eclipse that number every minute.
I wonder how many companies have properly designed their clients. So that the timing before re-attempt is randomised and the re-attempt timing cycle is logarithmic.
"I felt a great disturbance in us-east-1, as if millions of outage events suddenly cried out in terror and were suddenly silenced"
(Be interesting to see how many events currently going to DynamoDB are actually outage information.)
Yes, with no prior knowledge the mathematically correct estimate is:
time left = time so far
But as you note prior knowledge will enable a better guess.
Yeah, the Copernican Principle.
> I visited the Berlin Wall. People at the time wondered how long the Wall might last. Was it a temporary aberration, or a permanent fixture of modern Europe? Standing at the Wall in 1969, I made the following argument, using the Copernican principle. I said, Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here. My visit is random in time. So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters—that means, not in the first quarter and not in the fourth quarter.
> Let’s suppose that I’m at the beginning of that middle fifty percent. In that case, one-quarter of the Wall’s ultimate history has passed, and there are three-quarters left in the future. In that case, the future’s three times as long as the past. On the other hand, if I’m at the other end, then three-quarters have happened already, and there’s one-quarter left in the future. In that case, the future is one-third as long as the past.
https://www.newyorker.com/magazine/1999/07/12/how-to-predict...
> So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters
How come?
Note that this is equivalent to saying "there's no way to know". This guess doesn't give any insight, it's just the function that happens to minimize the total expected error for an unknowable duration.
Their status page (https://health.aws.amazon.com/health/status) says the only disrupted service is DynamoDB, but it's impacting 37 other services. It is amazing to see how big a blast radius a single service can have.
It's not surprising that it's impacting other services in the region because DynamoDB is one of those things that lots of other services build on top of. It is a little bit surprising that the blast radius seems to extend beyond us-east-1, mind.
In the coming hours/days we'll find out if AWS still have significant single points of failure in that region, or if _so many companies_ are just not bothering to build in redundancy to mitigate regional outages.
I'm looking forward to the RCA!
I'm real curious how much of AWS GovCloud has continued through this actually. But even if it's fine, from a strategic perspective how much damage did we just discover you could do with a targeted disruption at the right time?
AWS engineers are trained to use their internal services for each new system. They seem to like using DynamoDB. Dependencies like this should be made transparent.
Not sure why this is downvoted - this is absolutely correct.
A lot of AWS services under the hood depend on others, and especially us-east-1 is often used for things that require strong consistency like AWS console logins/etc (where you absolutely don't want a changed password or revoked session to remain valid in other regions because of eventual consistency).
Not "like using", they are mandated from the top to use DynamoDB for any storage. At my org in the retail page, you needed director approval if you wanted to use a relational DB for a production service.
It's now listing 58 impacted services, so the blast radius is growing it seems
The same page now says 58 services - just 23 minutes after your post. Seems this is becoming a larger issue.
When I first visited the page it said like 23 services, now it says 65
Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversity workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
Looks like it affected Vercel, too. https://www.vercel-status.com/
My website is down :(
(EDIT: website is back up, hooray)
Static content resolves correctly but data fetching is still not functional.
Have you done anything for it to be back up? Looks like mines are still down.
Looks as if they are rerouting to a different region.
mines are generally down
Service that runs on aws is down when aws is down. Who knew.
We just had a power outage in Ashburn starting at 10 pm Sunday night. It restored at 3:40am ish, and I know datacenters have redundant power sources but the timing is very suspicious. The AWS outage supposedly started at midnight
Even with redundancy, the response time between NYC and Amazon East in Ashburn is something like 10 ms. The impedance mismatch and dropped packets and increased latency would doom most organizations craplications.
Their latest update on the status page says it's a Dynamodb DNS issue
but the cause of that could be anything, including some kind of config getting wiped due to a temporary power outage
Careful: NPM _says_ they're up (https://status.npmjs.org/) but I am seeing a lot of packages not updating and npm install taking forever or never finishing. So hold off deploying now if you're dependent on that.
They've acknowledged an issue now on the status page. For me at least, it's completely down, package installation straight up doesn't work. Thankfully current work project uses a pull-through mirror that allows us to continue working.
"Thankfully current work project uses a pull-through mirror that allows us to continue working."
so there is no free coffee time???? lmao
Yep. It's the auditing part that is broken. As a (dangerous) workaround use --no-audit
Also npm audit times out.
Robinhood's completely down. Even their main website: https://robinhood.com/
Amazing, I wonder what their interview process is like, probably whiteboarding a next-gen LLM in WASM, meanwhile, their entire website goes down with us-east-1... I mean.
AWS truly does stand for "All Web Sites".
Internet, out.
Very big day for an engineering team indeed. Can't vibe code your way out of this issue...
Easiest day for engineers on-call everywhere except AWS staff. There’s nothing you can do except wait for AWS to come back online.
Pour one out for the customer service teams of affected businesses instead
Well, but tomorrow there will be CTOs asking for a contingency plan if AWS goes down, even if planning, preparing, executing and keeping it up to date as the infra evolves will cost more than the X hours of AWS outage.
There are certainly organizations for which that cost is lower than the overall damage of services being down due to AWS fault, but tomorrow we will hear CTOs from smaller orgs as well.
They’ll ask, in a week they’ll have other priorities and in a month they’ll have forgotten about it.
This will hold until the next time AWS had a major outage, rinse and repeat.
It's so true it hurts. If you are new in any infra/platform management position you will be scared as hell this week. Then you will just learn that feeling will just disappear by itself in a few days.
Lots of NextJS CTOs are gonna need to think about it for the first time too
No really true for large systems. We are doing things like deploying mitigations to avoid scale-in (eg services not receiving traffic incorrectly autoscaling down), preparing services for the inevitable storm, managing various circuit breakers, changing service configurations to ease the flow of traffic through the system, etc. We currently have 64 engineers in our on-call room managing this. There's plenty of work to do.
Can confirm, pretty chill we can blame our current issues on AWS.
and by one I trust you mean a bottle.
>Can't vibe code your way out of this issue...
I feel bad for the people impacted by the outage. But at the same time there's a part of me that says we need a cataclysmic event to shake the C-Suite out of their current mindset of laying off all of their workers to replace them with AI, the cheapest people they can find in India, or in some cases with nothing at all, in order to maximize current quarter EPS.
I expect it's their SREs who are dealing with this mess.
Pour one out for everyone on-call right now.
After some thankless years preventing outages for a big tech company, I will never take an oncall position again in my life.
Most miserable working years I have had. It's wild how normalized working on weekends and evenings becomes in teams with oncall.
But it's not normal. Our users not being able to shitpost is simply not worth my weekend or evening.
And outside of Google you don't even get paid for oncall at most big tech companies! Company losing millions of dollars an hour, but somehow not willing to pay me a dime to jump in at 3AM? Looks like it's not my problem!
> And outside of Google you don't even get paid for oncall at most big tech companies.
What the redacted?
When I used to be on call for Cisco WebEx services. I got paid extra, and got extra time of. Even if nothing happened. In addition we where enough people on the rotation, so I didn't have to do it that often.
I believe the rules varied based on jurisdiction, and I think some had worse deals, and some even better. But I was happy with our setup in Norway.
Tbh I do not think we would have had, what we had if it wasn't for the local laws and regulations. Sometimes worker friendly laws can be nice.
It's also unneccesary at large companies, since there'll likely be enough offices globally to have a follow the sun model.
Follow the sun does not happen by itself. Very few if any engineering teams are equally split across thirds of the globe in such a way that (say) Asia can cover if both EMEA and the Americas are offline.
Having two sites cover the pager is common, but even then you only have 16 working hours at best and somebody has to take the pager early/late.
"Your shitposting is very important to us, please stay on the site"
> But this is not normal. Our users not being able to shitpost is simply not worth my weekend or evening.
It is completely normal for staff to have to work 24/7 for critical services.
Plumbing, HVAC, power plant engineers, doctors, nurses, hospital support staff, taxi drivers, system and network engineers - these people keep our modern world alive, all day, every day. Weekends, midnights, holidays, every hour of every day someone is AT WORK to make sure our society functions.
Not only is it normal, it is essential and required.
It’s ok that you don’t like having to work nights or weekends or holidays. But some people absolutely have to. Be thankful there are EMTs and surgeons and power and network engineers working instead of being with their families on holidays or in the wee hours of the night.
You know, there's this thing called shifts. You should look it up.
Nice try at guilt-tripping people doing on-call, and doing it for free.
But to parent's points: if you call a plumber or HVAC tech at 3am, you'll pay for the privilege.
And doctors and nurses have shifts/rotas. At some tech places, you are expected to do your day job plus on-call. For no overtime pay. "Salaried" in the US or something like that.
And these companies often say "it's baked into your comp!" But you can typically get the same exact comp working an adjacent role with no oncall.
Then do that instead. What’s the problem with simply saying “no”?
Yup, that is precisely what I did and what I'm encouraging others to do as well.
Edit: On-call is not always disclosed. When it is, it's often understated. And finally, you can never predict being re-orged into a team with oncall.
I agree employees should still have the balls to say "no" but to imply there's no wrongdoing here on companies' parts and that it's totally okay for them to take advantage of employees like this is a bit strange.
Especially for employees that don't know to ask this question (new grads) or can't say "no" as easily (new grads or H1Bs.)
Guilt tripping? Quite the opposite.
If you or anyone else are doing on-call for no additional pay, precisely nobody is forcing you to do that. Renegotiate, or switch jobs. It was either disclosed up front or you missed your chance to say “sorry, no” when asked to do additional work without additional pay. This is not a problem with on call but a problem with spineless people-pleasers.
Every business will ask you for a better deal for them. If you say “sure” to everything you’re naturally going to lose out. It’s a mistake to do so, obviously.
An employee’s lack of boundaries is not an employer’s fault.
First, you try to normalise it:
> It is completely normal for staff to have to work 24/7 for critical services.
> Not only is it normal, it is essential and required.
Now you come with the weak "you don't have to take the job" and this gem:
> An employee’s lack of boundaries is not an employer’s fault.
As if there isn't a power imbalance, or employers always disclose everything or chance their mind. But of course, let's blame those entitled employees!
No one dies if our users can't shitpost until tomorrow morning.
I'm glad there are people willing to do oncall. Especially for critical services.
But the software engineering profession as a whole would benefit from negotiating concessions for oncall. We have normalized work interfering with life so the company can squeeze a couple extra millions from ads. And for what?
Nontrivial amount of ad revenue lost? Not my problem if the company can't pay me to mitigate.
> Nontrivial amount of ad revenue lost? Not my problem if the company can't pay me to mitigate.
Interestingly, when I worked on analytics around bugs we found that often (in the ads space), there actually wasn't an impact when advertisers were unable to create ads, as they just created all of them when the interface started working again.
Now, if it had been the ad serving or pacing mechanisms then it would've been a lot of money, but not all outages are created equal.
Not all websites are for shitposting. I can’t talk to my clients for whom I am on call because Signal is down. I also can’t communicate with my immediate family. There are tons of systems positively critical to society downstream from these services.
Some can tolerate downtime. Many can’t.
You could give them a Phone call, you know. Pretty reliable technology.
> Can't vibe code your way out of this issue...
Exactly. This time, some LLM providers are also down and can't help vibe coders on this issue.
Qwen3 on lm-studio running fine on my work Mac M3, what's wrong with yours?
US-East-1 and its consistent problems are literally the Achilles Heel of the Internet.
r/aws not found
There aren't any communities on Reddit with that name. Double-check the community name or start a new community.
A lot of status pages hosted by Atlasian StatusPage are down! The irony…
Wonder if this is related
https://www.dockerstatus.com/pages/533c6539221ae15e3f000031
Yup
> We have identified the underlying issue with one of our cloud service providers.
Our Alexa's stopped responding and my girl couldn't log in to myfitness pal anymore.. Let me check HN for a major outage and here we are :^)
At least when us-east is down, everything is down.
When I follow the link, I arrive on a "You broke reddit" page :-o
Signal is also down for me.
My messages are not getting through, but status page seems ok.
Is there any data on which AWS regions are most reliable? I feel like every time I hear about an AWS outage it's in us-east-1.
Trouble is one can't fully escape us-east-1. Many services are centralized there like: S3, Organizations, Route53, Cloudfront, etc. It is THE main region, hence suffering the most outages, and more importantly, the most troubling outages.
We're mostly deployed on eu-west-1 but still seeing weird STS and IAM failures, likely due to internal AWS dependencies.
Also we use Docker Hub, NPM and a bunch of other services that are hosted by their vendors on us-east-1 so even non AWS customers often can't avoid the blast radius of us-east-1 (though the NPM issue mostly affects devs updating/adding dependencies, our CI builds use our internal mirror)
FYI: 1. AWS IAM mutations all go through us-east-1 before being replicated to other public/commercial regions. Read/List operations should use local regional stacks. I expect you'll see a concept of "home region" give you flexibility on the write path in the future. 2. STS has both global and regional endpoints. Make sure you're setup to use regional endpoints in your clients https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credenti...
Anywhere other than us-east-1 in my experience is rock solid.
us-east-1 was, probably still is, AWS' most massive deployment. Huge percentage of traffic goes through that region. Also, lots of services backhaul to that region, especially S3 and CloudFront. So even if your compute is in a different region (at Tower.dev we use eu-central-1 mostly), outages in us-east-1 can have some halo effect.
This outage seems really to be DynamoDB related, so the blast radius in services affected is going to be big. Seems they're still triaging.
your website loads for a second and then suddenly goes blank. There is one fatal errors from Framer in the console
It is dark. You are likely to be eaten by a grue.
If you're using AWS then you are most likely using us-east-1 there is no escape. When big problems happen on us-east-1 it affect most of AWS services.
I don't recommend to my clients they use us-east-1. It's the oldest and most prone to outages. I usually always recommend us-east-2 (Ohio) unless they require West Coast.
I'm so happy we chose Hetzner instead but unfortunately we also use Supabase (dashboard affected) and Resend (dashboard and email sending affected).
Probably makes sense to add "relies on AWS" to the criteria we're using to evaluate 3rd-party services.
Here's the AWS status page: https://health.aws.amazon.com/health/status?ts=20251020
Isn't there a better source of information than Reddit?
Probably not. The sysadmin sub is usually the first place stuff like this shows up because there’s a bunch of oncall guys there
Maybe the mods can change it to https://health.aws.amazon.com/health/status
amazon's health page is widely enjoyed as a work of fiction. community reports on places like reddit are, actually, more reliable.
Especially since it's down as well.
Lots of outage in Norway, started approximately 1 hour ago for me.
Reddit seems to be having issues too:
"upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout"
>Oct 20 12:51 AM PDT We can confirm increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. This issue may also be affecting Case Creation through the AWS Support Center or the Support API. We are actively engaged and working to both mitigate the issue and understand root cause. We will provide an update in 45 minutes, or sooner if we have additional information to share.
Weird that case creation uses the same region as the case you'd like to create for.
Various AI services (e.g. Perplexity) are down as well
Just tried Perplexity and it has no answer.
Damn, this is really bad.
Looking forward to the postmortem.
Related thread: https://news.ycombinator.com/item?id=45640838
DynamoDB is performing fine in production in eu-central-1.
Seems to be really limited to us-east-1 (https://health.aws.amazon.com/health/status). I think they host a lot of console and backend stuff there.
Yet. Everything goes down the ... Bach ;)
Slack is down. Is that related? Probably is.
02:34 Pacific: Things seem to be recovering.
Bitbucket seems affected too [1]. Not sure if this status page is regional though.
[1] https://bitbucket.status.atlassian.com/incidents/p20f40pt1rg...
This is from Amazon's latest earnings call when Andy Jessy was asked why they aren't growing as much as there competitors
"I think if you look at what matters to customers, what they care they care a lot about what the operational performance is, you know, what the availability is, what the durability is, what the latency and throughput is of of the various services. And I think we have a pretty significant advantage in that area." also "And, yeah, you could just you just look at what's happened the last couple months. You can just see kind of adventures at some of these players almost every month. And so very big difference, I think, in security."
https://news.ycombinator.com/item?id=45640754
And https://news.ycombinator.com/item?id=45640993
In this moments I think devs should invest in vendor independence if they can. While I'm not to that stage yet (cloudlfare dependence) using open technologies like docker (or Kubernetes), Traefik instead of managed services can help in this disaster situations by switching to a different provider in a faster way than having to rebuild from zero. as a disclosure I'm not still to that point on my infrastructure But I'm trying to slowly define one for my self
Of course this happens when I take a day off from work lol
Came here after the Internet felt oddly "ill" and even got issues using Medium, and sure enough https://status.medium.com
https://status.tailscale.com/ clients' auth down :( what a day
That just says the homepage and knowledge base are down and that admin access specifically isn’t effected.
yep, admin panel works, but in practice my devices are logged out and there is no way to re-authorize them.
I can authenticate my devices just fine.
AWS has been the backbone of the internet. It is single point of failure most websites.
Other hosting services like Vercel, package managers like npm, even the docker registeries are down because of it.
One of the radio stations I listen to is just dead air tonight. I assume this is the cause.
Amazon itself apperas to be out for some products. I get a "Sorry, We couldn't find that page" when clicking on products
I'm thinking about that one guy who clicked on "OK" or hit return.
Somebody, somewhere tried to rollback something and it failed
My Alexa is hit or miss at responding to queries right now at 5:30 AM EST. Was wondering why it wasn't answering when I woke up.
"Oct 20 2:01 AM PDT We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1..."
It's always DNS...
It's not DNS
There's no way it's DNS
It was DNS
That or a Windows update.
Maybe unrelated, but yesterday I went to pick up my package from an Amazon Locker in Germany, and the display said "Service unavailable". I'll wait until later today before I go and try again.
That strange feeling of the world getting cleaner for a while without all these dependant services.
can't log into https://amazon.com either after logging out; so many downstream issues
US-East-1 is literally the Achilles Heel of the Internet.
You would think that after the previous big us-east-1 outages (to be fair there have been like 3 of them in the past decade, but still, that's plenty), companies would have started to move to other AWS regions and/or to spread workloads between them.
Exactly
Looks like a DNS issue - dynamodb.us-east-1.amazonaws.com is failing to resolve.
"Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1."
it seems they found your comment
Coinbase down as well: https://status.coinbase.com/
Best option for a whale to manipulate the price again.
I seem to recall other issues around this time in previous years. I wonder if this is some change getting shoe-horned in ahead of some reinvent release deadline...
Can't login to Jira/Confluence either.
Seems to work fine for me. I'm in Europe so maybe connecting to some deployment over here.
You are already logged in. If you try to access your account settings, for example, you will be disappointed...
The Ring (Doorbell) App isn't working, nor is any the MBTA (Transit) Status pages/apps.
My apartment uses “SmartRent” for access controls and temps in our unit. It’s down…
AWS's own management console sign-in isn't even working. This is a huge one. :(
Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?
I get the impression that this has been thought about to some extent, but its a constantly changing architecture with new layers and new ideas being added, so for every bit of progress there's the chance of new Single Points Of Failure being added. This time it seems to be a DNS problem with DynamoDB
Slack, Jira and Zoom are all sluggish for me in the UK
I wonder if that's not due to dependencies on AWS but all-hands-on-deck causing far more traffic than usual
Happened to be updating a bunch of NPM dependencies and then saw `npm i` freeze and I'm like... ugh what did I do. Then npm login wasn't working and started searching here for an outage, and wala.
voila
As of 4:26am Central Time in the USA, it's back up for one of my services.
Nowadays when this happens it's always something. "Something went wrong."
Even the error message itself is wrong whenever that one appears.
Displaying and propagating accurate error messages is an entire science unto itself... ...I can see why it's sometimes sensible to invest resource elsewhere and fall back to 'something'.
I use the term “unexpected error” because if the code got to this alert it wasn’t caught by any traps I’d made for the “expected” errors.
IMHO if error handling is rocket science, the error is you
Perhaps you're not handling enough errors ;-)
Reddit shows:
"Too many requests. Your request has been rate limited, please take a break for a couple minutes and try again."
Can't update my selfhosted HomeAssistant because HAOS depends on dockerhub which seems to be still down.
Reddit itself breaking down and errors appear. Does reddit itself depends on this?
My website on the cupboard laptop is fine.
Appears to have also disabled that bot on HN that would be frantically posting [dupe] in all the other AWS outage threads right about now.
Slack and Zoom working intermittently for me
It seems that all the sites that ask for distributed systems in their interview and has their website down wouldn't even pass their own interview.
This is why distributed systems is an extremely important discipline.
Maybe actually making the interviews less of a hazing ritual would help.
Hell, maybe making today's tech workplace more about getting work done instead of the series of ritualistic performances that the average tech workday has degenerated to might help too.
Ergo, your conclusion doesn't follow from your initial statements, because interviews and workplaces are both far more broken than most people, even people in the tech industry, would think.
Well it looks like if companies and startups did their job in hiring the proper distributed systems skills more rather than hazing for the wrong skills we wouldn't be in this outage mess.
Many companies on Vercel don't think to have a strategy to be resilient to these outages.
I rarely see Google, Ably and others serious about distributed systems being down.
There was a huuuge GCP outage just a few months back: https://news.ycombinator.com/item?id=44260810
> Many companies on Vercel don't think to have a strategy to be resilient to these outages.
But that's the job of Vercel and it looks like they did a pretty good job. They rerouted away from the broken region.
distributed systems != continuous uptime
Statuspage.io seems to load (but is slow) but what is the point if you can't post an incident because Atlassian ID service is down.
Presumably the root cause of the major Vercel outage too: https://www.vercel-status.com/
No wonder, when I opened Vercel it showed a 502 error.
Airtable is down as-well.
A lot of businesses have all their workflows depending on their data on airtable.
Strangely some of our services are scaling up on east-1, and there is downtick on downdetector.com so issue might be resolving.
Both Intercom and Twilio are affected, too.
- https://status.twilio.com/ - https://www.intercomstatus.com/us-hosting
I want the web ca. 2001 back, please.
Seems to be upsetting Slack a fair bit, messages taking an age to send and OIDC login doesn't want to play.
They haven't listed SES there yet in the affected services on their status page
Asana down Postman workspaces don't load Slack affected And the worst: heroku scheduler just refused to trigger our jobs
It's fun to see SRE jumping left and right when they can do basically nothing at all.
"Do we enable DR? Yes/No". That's all you can do. If you do, it's a whole machinery starting, which might take longer than the outage itself.
They can't even use Slack to communicate - messages are being dropped/not sent.
And then we laugh at the South Koreans for not having backed up their hard drives (which got burnt by actual fire, a statistically way less occurring event than an AWS outage). OK that's a huge screw up, but hey, this is not insignificant either.
What will happen now? Nothing, like nothing happened after Crowdstrike's bug last year.
Signal seems to be dead too though, which is much more of a WTF?
A decentralized messenger is Tox.
npm and pnpm are badly affected as well. Many packages are returning 502 when fetched. Such a bad time...
Yup, was releasing something to prod and can't even build a react app. I wonder if there is some sort of archive that isn't affected?
AWS CodeArtifact can act as a proxy and fetch new packages from npm when needed. A bit late for that though but sharing if you want to future proof against the yearly us-east-1 outage
Oh damn that ruins all our builds for regions I thought would be unaffected
I did get 500 error from their public ECR too
Why would us-east-1 cause many UK banks and even UK gov web sites down too!? Shouldn't they operate in the UK region due to GDPR?
Integration with USA for your safety :)
2 things:
1) GDPR is never enforced other than token fines based on technicalities. The vast majority of the cookie banners you see around are not compliant, so it the regulation was actually enforced they'd be the first to go... and it would be much easier to go after those (they are visible) rather than audit every company's internal codebases to check if they're sending data to a US-based provider.
2) you could technically build a service that relies on a US-based provider while not sending them any personal data or data that can be correlated with personal data.
Terraform Cloud is having problem too
There will be a lot of system starting cold l. I am really curious to see how many will manage it without hiccups.
Snow day!
10:30 on a Monday morning and already slacking off. Life is good. Time to touch grass, everybody!
glad all my services are either Hetzner servers or EU region of AWS!
Atlassian cloud is having problems as well.
Clearly this is all some sort of mass delusion event, the Amazon Ring status says everything is working.
https://status.ring.com/
(Useless service status pages are incredibly annoying)
Atlassian is down as well so they probably can't access their Atlassian Statuspage admin panel to update it.
When you know a service is down but the service says it's up: it's either your fault or the service is having a severe issue
BGP (again)?
O ffs. I can't even access the NYT puzzles in the meantime ... Seriously disrupted, man
seems like services are slowly recovering
Considering the history of east-1 it is fascinating that it still causes so many single point of failure incidents for large enterprises.
Happy Monday People
Slack now also down: https://slack-status.com/
They are amazing at LeetCode though.
Let's be nice. I'm sure devs and ops are on fire right now, trying to fix the problems. Given the audience of HN, most of us could have been (have already been?) in that position.
No we wouldn’t because there’s like a 50/50 chance of being a H1B/L1 at AWS. They should rethink their hiring and retention strategies.
They choose their hiring-retention practices and they choose to provide global infrastructure, when is the good time to criticise them?
Granted, they are not as drunk on LLM as Google and Microsoft. So, at least we can say this outage had not been vibe-coded (yet).
hugops ftw
Affecting Coinbase[1] as well, which is ridiculous. Can't access the web UI at all. At their scale and importance they should be multi-region if not multi-cloud.
[1] https://status.coinbase.com
Seems the underlying issue is with DynamoDB, according to the status page, which will have a big blast radius in other services. AWS' services form a really complicated graph and there's likely some dependency, potentially hidden, on us-east-1 in there.
The issue appears to be cascading internationally due to internal dependencies on us-east-1
Good luck to all on-callers today.
It might be an interesting exercise to map how many of our services depend on us-east-1 in one way or another. One can only hope that somebody would do something with the intel, even though it's not a feature that brings money in (at least from business perspective).
Substack seems to by lying about their status: https://substack.statuspage.io/
It's weird that we're living in a time where this could be a taste of a prolonged future global internet blackout by adversarial nations. Get used to this feeling I guess :)
Can't log into tidal for my music
Navidrome seems fine
Ring is affected. Why doesn’t Ring have failover to another region?
That's understandably bad for anyone who depends on Ring for security but arguably a net positive for the rest of us.
Amazon’s Ring to partner with Flock: https://news.ycombinator.com/item?id=45614713
It's a reminder to never rely on something as flaky as the internet for your important things.
This is such an HN response. Oh, no problem, I'll just avoid the internet for all of my important things!
Door locks, heating and household appliances should probably not depend on Internet services being available.
Do you not have a self-hosted instance of every single service you use? :/
They are probably being sarcastic.
Not very helpful. I wanted to make a very profitable trade but can’t login to my brokerage. I’m losing about ~100k right now.
what's the trade?
Probably AWS stock...
This reminds me of the twitter-based detector we had at Facebook that looked for spikes in "Facebook down" messages.
When Facebook went public, the detector became useless because it fired anytime someone wrote about the Facebook stock being down and people retweeted or shared the article.
I invested just enough time in it to decide it was better to turn it off.
Beyond Meat
I love this to be honest. Validates my anti cloud stance.
No service that does not run on cloud has ever had outages.
But at least a service that doesn't run on cloud doesn't pay the 1000% premium for its supposed "uptime".
At least its in my control :)
Not having control or not being responsible are perhaps major selling points of cloud solutions. To each their own, I also rather have control than having to deal with a cloud provider support as a tiny insignificant customer. But in this case, we can take a break and come back once it's fixed without stressing.
Businesses not taking responsibility for their own business should not exist in the first place...