I'm so surprised there is so much pushback against this.. AWS is extremely expensive. The use cases for setting up your system or service entirely in AWS are more rare than people seem to realise. Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?
> We have 730+ days with 99.993% measured availability and we also escaped AWS region wide downtime that happened a week ago.
This is a very nice brag. Given they are using their ddos protection ingress via CloudFlare there is that dependancy, but in that case I can 100% agree than DNS and ingress can absolutely be a full time job. Running some microservices and a database absolutely is not. If your teams are constantly monitoring and adjusting them such as scaling, then the problem is the design. Not the hosting.
Unless you're a small company serving up billions of heavy requests an hour, I would put money on the bet AWS is overcharging you.
The direct cost is the easy part. The more insidious part is that you're now cultivating a growing staff of technologists whose careers depend on doing things the AWS way, getting AWS certified to ensure they build your systems the AWS Well Architected Way instead of thinking themselves, and can upsell you on AWS lock-in solutions using AWS provided soundbites and sales arguments.
("Shall we make the app very resilient to failure? Yes running on multiple regions makes the AWS bill bigger but you'll get much fewer outages, look at all this technobabble that proves it")
And of course AWS lock-in services are priced to look cheaper compared to their overpricing of standard stuff[1] - if you just spend the engineering effort and IaC coding effort to move onto them, this "savings" can be put to more AWS cloud engineering effort which again makes your cloud eng org bigger and more important.
[1] (For example implementing your app off containers to Lambda, or the db off PostgreSQL to DynamoDB etc)
I don't think it is easy. I see most organizations struggle with the fact that everything is throttled in the cloud. CPU, storage, network. Tenants often discover large amounts of activity they were previously unaware of, that contributes to the usage and cost. And there may be individuals or teams creating new usages that are grossly impacting their allocation. Did you know there is a setting in MS SQL Server that impacts performance by an order of magnitude when sending/receiving data from the Cloud to your on-premises servers? It's the default in the ORM generated settings.
Then you can start adding in the Cloud value, such as incomprehensible networking diagrams that are probably non-compliant in some way (guess which ones!), and security? What is it?
> Did you know there is a setting in MS SQL Server that impacts performance by an order of magnitude when sending/receiving data from the Cloud to your on-premises servers? It's the default in the ORM generated settings.
Unfortunately it's not, and it gets more difficult the more cloud-y your app gets.
You can pay for EC2+EBS+network costs, or you can have a fancy cloud native solution where you pay for Lambda, ALBs, CloudWatch, Metrics, Secret Manager, (things you assume they would just give you, like if you eat at a restaurant, you probably won't expect to pay for the parking, toilet, or paying rent for the table and seats).
So cloud billing is its own science and art - and in most orgs devs don't even know how much the stuff they're building costs, until finance people start complaining about the monthly bills.
We run regular FinOps meetings within departments, so everyone’s aware. I think everyone should. But it’s a lot of overhead of course. So a dev is concerned not only with DevOps anymore but with DevSecFinOps. Not everyone can cope with so many aspects at once. There’s a lot of complexity creep in that.
Yeah, AWS has the billing panel, that's where I usually discover that after I make a rough estimate on how much the thing I'm building should cost by studying the relevant tables, I end up with stuff costing twice as much, because on top of the expected items there's always a ton of miscellaneous stuff I never thought about.
I was about to rage at you over the first sentence, because this is so often how people start trying to argue bare metal setups are expensive. But after reading the rest: 100% this. I see so many people push AWS setups not because it's the best thing - it can be if you're not cost sensitive - but because it is what they know and they push what they know instead of evaluating the actual requirements.
The weird thing is I'm old enough to have grown up in the pre-cloud world, and most of the stuff, like file servers, proxies, dbs, etc. isn't any more difficult to set up than AWS stuff, it's just that the skills are different
Also there's a mindset difference - if I gave you a server with 32 cores you wouldn't design a microservice system on it, would you? After all there's nowhere to scale to.
But with AWS, you're sold the story of infinite compute you can just expect to be there, but you'll quickly find out just how stingy they can get with giving you more hardware automatically to scale to.
I don't dislike AWS, but I feel this promise of false abundance has driven the growth in complexity and resource use of the backend.
Reality tends to be you hit a bottleneck you have a hard time optimizing away - the more complex your architecture, the harder it is, then you can stew.
Well, they aren't wrong about the bare metal either: Every organization ends up tied to their staff, and said staff was hired to work on the stack you are using. People end up in quite the fights because their supposed experts are more fond of uniformity and learning nothing new.
Many a company was stuck with a datacenter unit that was unresponsive to the company's needs, and people migrated to AWS to avoid dealing with them. This straight out happened in front of my eyes multiple times. At the same time, you also end up in AWS, or even within AWS, using tools that are extremely expensive, because the cost-benefit analysis for the individuals making the decision, who often don't know very much other than what they use right now, are just wrong for the company. The executive on top is often either not much of a technologist or 20 years out of date, so they have no way to discern the quality of their staff. Technical disagreements? They might only know who they like to hang out with, but that's where it ends.
So for path dependent reasons, companies end up making a lot of decisions that in retrospect seem very poor. In startups if often just kills the company. Just don't assume the error is always in one direction.
Sure but I have seen the exact same thing happen with AWS.
In a large company I worked the Ops team that had the keys to AWS was taking literal months to push things to the cloud, causing problems with bonuses and promotions. Security measures were not in place so there were cyberattacks. Passwords of critical services lapsed because they were not paying attention.
At some point it got so bad that the entire team was demoted, lost privileges, and contractors had to jump in. The CTO was almost fired.
It took months to recover and even to get to an acceptable state, because nothing was really documented.
It's simple enough to hire people with experience with both, or pay someone else to do it for you. These skills aren't that hard to find.
If you hire people that are not responsive to your needs, then, sure, that is a problem that will be a problem irrespective of what their pet stack is.
> Many a company was stuck with a datacenter unit that was unresponsive to the company's needs
I'd like to +1 here - it's an understated risk if you've got datacenter-scale workloads. But! You can host a lot of compute on a couple racks nowadays, so IMHO it's a problem only if you're too successful and get complacent. In the datacenter, creative destruction is a must and crucially finance must be made to understand this, or they'll give you budget targets which can only mean ossification.
> said staff was hired to work on the stack you are using
Looking back at doing various hiring decisions at various levels of organizations, this is probably the single biggest mistake I've done multiple times, hiring specific people using specific technology because we were specifically using that.
You'll end up with a team unwilling to change, because "you hired me for this, even if it's best for the business with something else, this is what I do".
Once I and the organizations shifted our mindset to hiring people who are more flexible, even if they have expertise in one or two specific technologies, they won't put their head in the sand whenever changes come up, and everything became a lot easier.
Exactly. If someone has "Cloud Engineer" in the headline of their resume instead of "Devops Engineer" it's already warning and worth probing. If someone has "AWS|VMWare Engineer" in their bio, it's a giant red flag to me. Sometimes it's people just being aware where they'll find demand, but often it's indicative of someone who will push their pet stack - and it doesn't matter if it's VMWare on-prem or AWS (both purely as examples; it doesn't matter which specific tech it is), it's equally bad if they identify with a specific stack irrespective of what the stack is.
I'll also tend to look closely at whether people have "gotten stuck" specialising in a single stack. It won't make me turn them down, but it will make me ask extra questions to determine how open they are to alternatives when suitable.
The entire value proposition of AWS vs running one's own server is basically this: is it easier to ask for permission, or forgiveness? You're asking for permission to get a million dollars worth of servers / hardware / power upgrades now, or you're asking for forgiveness for spending five million dollars in AWS after 10 months. Which will be easy: permission or forgiveness?
Your comment also jogged my memory of how terrible bare metal days used to be. I think now with containers it can be better but the other reason so many switched to cloud is we don’t need to think about buying the bare metal ahead of time. We don’t need to justify it to a DevOps gatekeeper.
That so many people remember bare metal as of 20+ years ago is a large part of the problem.
A modern server can be power cycled remotely, can be reinstalled remotely over networked media, can have its console streamed remotely, can have fans etc. checked remotely without access to the OS it's running etc. It's not very different from managing a cloud - any reasonable server hardware has management boards. Even if you rent space in a colo, most of the time you don't need to set foot there other than for an initial setup (and you can rent people to do that too).
But for most people, bare metal will tend to mean renting bare metal servers already configured anyway.
When the first thing you then tend to do is to deploy a container runtime and an orchestrator, you're effectively usually left with something more or less (depending on your needs) like a private cloud.
As for "buying ahead of time", most managed server providers and some colo operators also offer cloud services, so that even if you don't want to deal with a multi-provider setup, you can still generally scale into cloud instances as needed if your provider can't bring new hardware up fast enough (but many managed server providers can do that in less than a day too).
I never think about buying ahead of time. It hasn't been a thing I've had to worry about for a decade or more.
>I see so many people push AWS setups not because it's the best thing - it can be if you're not cost sensitive - but because it is what they know and they push what they know instead of evaluating the actual requirements.
I kinda feel like this argument could be used against programming in essentially any language. Your company, or you yourself, likely chose to develop using (whatever language it is) because that's what you knew and what your developers knew. Maybe it would have been some percentage more efficient to use another language, but then you and everyone else has to learn it.
It's the same with the cloud vs bare metal, though at least in the cloud, if your using the right services, if someone asked you tomorrow to scale 100x you likely could during the workday.
And generally speaking if your problem is at a scale where baremetal is trivial to implement, its likely we're only taking about a few hundred dollars a month being 'wasted' in AWS. Which is nothing to most companies, especially when they'd have to consider developer/devops time.
> if someone asked you tomorrow to scale 100x you likely could during the workday.
I've never seen a cloud setup where that was true.
For starters: Most cloud providers will impose limits on you that often means going 100x would involve pleading with account managers to have limits lifted and/or scrounding a new, previously untested, combination of instance sizes.
But secondly, you'll tend to run into unknown bottlenecks long before that.
And so, in fact, if that is a thing you actually want to be able to do, you need to actually test it.
But it's also generally not a real problem. I more often come across the opposite: Customers who've gotten hit with a crazy bill because of a problem rather than real use.
But it's also easy enough to set up a hybrid setup that will spin up cloud instances if/when you have a genuine need to be able to scale up faster than you can provision new bare metal instances. You'll typically run an orchestrator and run everything in containers on a bare metal setup too, so typically it only requires having an auto-scaling group scaled down to 0, and warm it up if load nears critical level on your bare metal environment, and then flip a switch in your load balancer to start directing traffic there. It's not a complicated thing to do.
Now, incidentally, your bare metal setup is even cheaper because you can get away with a higher load factor when you can scale into cloud to take spikes.
> And generally speaking if your problem is at a scale where baremetal is trivial to implement, its likely we're only taking about a few hundred dollars a month being 'wasted' in AWS. Which is nothing to most companies, especially when they'd have to consider developer/devops time.
Generally speaking, I only relatively rarely work on systems that cost less than in the tens of thousands per month and up, and what I consistently see with my customers is that the higher the cost, the bigger the bare-metal advantage tends to be as it allows you to readily amortise initial setup costs of more streamlined/advanced setups. The few places where cloud wins on cost is the very smallest systems, typically <$5k/month.
It’s a marketing trap. But also a job guarantee since everyone’s in the same trap. You got a couple cloud engineers or "DevOps" that lobby for AWS or any other hyperscaler, NaiveDate managers that write down some decision report littered with logical fallacies, and a few years in the sink cost is so high you can’t get off of it, and instead of doing productivity work you’re sitting in myriads of FinOps meetings, where even fewer understand what’s going on.
Engineering mangers are promised cost savings on the HR level. Corporate finance managers are promised OpEx for CapEx trade-off, the books look better immediately. Cloud engineers are embarking on their AWS journey of certification being promised an uptick to their salaries. It’s a win/win for everyone, in isolation, a local optimum for everyone, but the organization now has to pay way more than it—hypothetically—would have been paying for bare metal ops. And hypothetical arguments are futile.
And it lends itself well to overengineering and the microservices cargo cult. Your company ends up with a system distributed around the globe across multiple AZs per region of business operations, striving to shave off those 100ms latency off your clients’ RTT. But it’s outgrown your comprehension, and it’s slow anyway, and you can’t scale up because it’s expensive. And instead of having one problem, you now have 99 and your bill is one.
My last team decided to hand manage a Memcached cluster because it cost half as much as an unmanaged service versus AWS’s alternative. Don’t know how much we really saved versus opportunity cost on dev time though. But it’s close to negative.
One of the issues there is that pricing a managed service deprives your people or gaining extra experience. There’s a synergy over time, the more you manage yourself. But it’s totally justified to pick a managed service if it checks out for your budget. The problem I saw often emanate was bad decision making, bad opportunity cost estimation. In other words, there’s an opportunity cost to picking the managed service, too, and they offset each other more or less.
My manager wants me to make this silly AWS certification.
Let me go on a tangent about trains. In Spain before you board a high-speed train you need to go though full security check, like on an airport. In all other EU countries you just show up and board, but in Spain there's the security check. The problem is that even though the security check is an expensive, inefficient theatre, just in case something does blow up, nobody wants to be the politician that removed the security check. There will be no reward for a politician that makes life marginally easier for lots of people, but there will be severe punishment for a politician that is involved in a potential terrorist attack, even if the chance of that happening is ridiculously small.
This is exactly why so many companies love to be balls deep into AWS ecosystem, even if it's expensive.
> In all other EU countries you just show up and board, but in Spain there's the security check
Just for curiosity's sake, did any other EU countries have any recent terrorist attacks involving bombs on trains in the capital, or is Spain so far alone with this experience?
AFAIK, there is no security scanning on the metro/"tube" in Spain either, it's on the national train lines.
Edit: Also, after looking it up, it seems like London did add temporary security scanners at some locations in the wake of those bombings, although they weren't permanent.
Russia is the only other European country besides Spain that after train bombings added permanent security scanners. Belgium, France and a bunch of other countries have had train bombings, but none of them added permanent scanners like Spain or Russia did.
AWS may be overcharging but it's a balancing act. Going on-prem (well, shared DC) will be cheaper but comes with requirements for either jack of all trades sysadmins or a bunch of specialists. It can work well if your product is simple and scalable. A lot of places quietly achieve this.
That said, I've seen real world scenarios where complexity is up the wazoo and an opex cost focus means you're hiring under skilled staff to manage offerings built on components with low sticker prices. Throw in a bit of the old NIH mindset (DIY all the things!) and it's large blast radii with expensive service credits being dished out to customers regularly. On a human factors front your team will be seeing countless middle of the night conference calls.
While I'm not 100% happy with the AWS/Azure/GCP world, the reality is that on-prem skillsets are becoming rarer and more specialist. Hiring good people can be either really expensive or a bit of a unicorn hunt.
It's a chicken and egg problem. If the cloud didn't become such a proeminent thing, the last decade and a half would have seen the rise of much better tools to manage on-premise servers (= requiring less in-depth sysadmin expertise). I think we're starting to see such tools appear in the last few years after enough people got burned by cloud bills and lockin.
And don't forget the real crux of the problem: Do I even know whether a specialist is good or not? Hiring experts is really difficult if you don't have the skill in the topic, and if you do, you either not need an expert, or you will be biased towards those that agree with you.
It's not even limited to sysadmins, or in tech. How do you know whether a mechanic is very good, or iffy? Is a financial advisor giving you good advice, or basically robbing you? It's not as if many companies are going to hire 4 business units worth of on prem admins, and then decide which one does better after running for 3 years, or something empirical like that. You might be the poor sob that hires the very expensive, yet incompetent and out of date specialist, whose only remaining good skill is selling confidence to employers.
> Do I even know whether a specialist is good or not?
Of course but unless I misunderstood what you meant to say, you don't escape that by buying from AWS. It's just that instead of "sysadmin specialists" you need "AWS specialists".
If you want to outsource the job then you need to go up at least 1 more layer of abstraction (and likely an order of magnitude in price) and buy fully managed services.
This only gets worse as you go higher in management. How does a technical founder know what good sales or marketing looks like? They are often swayed by people who can talk a good talk and deliver nothing.
The good news with marketing and sales is that you want the people who talk a good talk, so you're halfway there, you just gotta direct them towards the market and away from bilking you.
At the same time, the incredible complexity of the software infrastructure is making specialists more and more useless. To the point that almost every successful specialist out there is just some disguised generalist that decided to focus their presentation in a single area.
Maybe everyone is retaining generalists. I keep being given retention bonuses every year, without asking for a single one so far.
As mentioned below, never labeled "full stack", never plan on it. "Generalist" is what my actual title became back in the mid 2000s. My career has been all over the place... the key is being stubborn when confronted with challenges and being able to scale up (mentally and sometimes physically) to meet the needs, when needed. And chill out when it's not.
I throw up in my mouth every time I see "full stack" in a job listing.
We got rid of roles... DBA's, QA teams, Sysadmins, then front and back end. Full Stack is the "webmaster" of the modern era. It might mean front and back end, it might mean sysadmin and DBA as well.
You can easily get your service up by asking claude code or whatever to just do it
It produces aws yaml that’s better than many devops people I’ve worked with. In other words, it absolutely should not be trusted with trivial tasks, but you could easily blow $100K’s per year for worse.
I've been contemplating this a lot lately, as I just did code review on a system that was moving all the AWS infrastructure into CDK, and it was very clear the person doing it was using an LLM which created a really complicated, over engineered solution to everything. I basically rewrote the entire thing (still pairing with Claude), and it's now much simpler and easier to follow.
So I think for developers that have deep experience with systems LLMs are great -- I did a huge migration in a few weeks that probably would have taken many months or even half a year before. But I worry that people that don't really know what's going on will end up with a horrible mess of infra code.
To me it's clear that most Ops engineers are vibe coding their scripts/yamls today.
The time difference between having a script ready has decreased dramatically in the last 3 years. The amount of problems when deploying the first time has also increased in the same period.
The difference between the ones who actually know what they're doing and the ones who don't is whether they will refactor and test.
It depends upon how many resources your software needs. At 20 servers we spend almost zero time managing our servers, and with modern hardware 20 servers can get you a lot.
Its easier than ever to do this but people are doing it less and less.
Managed servers reduce the on-prem skillset requirement and can also deliver a lot of value.
The most frustrating part of hyperscalers is that it's so easy to make mistakes. Active tracking of you bill is a must, but the data is 24-48h late in some cases. So a single engineer can cause 5-figure regrettable spend very quickly.
> I'm so surprised there is so much pushback against this.. AWS is extremely expensive.
Basic rationalization. People will go to extraordinary lengths to justify and defend the choices they made. It's a defense mechanism: if they spent millions on AWS they are not going to sit idly while HN discusses saving hundreds of thousands with everyone nodding and agreeing. It's important for their own sanity to defend the choice they made.
> I'm so surprised there is so much pushback against this
I'm not. It seems to be happening a lot. Any time a topic about not using AWS comes up here, or on Reddit there a sudden surge of people appearing out of nowhere shouting down anyone who suggests other options. It's honestly starting to feel like paid shilling.
I don’t think it’s paid shilling, it’s dogma that reflects where people are working here. The individual engineers are hammers and AWS is the nail.
AWS/Azure/GCP is great, but like any tool or platform you need to do some financial/process engineering to make an optimal choice. For small companies, time to market is often key, hence AWS.
Once you’re a little bigger, you may develop frameworks to operate efficiently. I have apps that I run in a data center because they’d cot 10-20x at a cloud provider. Conversely, I have apps that get more favorable licensing terms in AWS that I run there, even though the compute is slower and less efficient.
You also have people who treat AWS with the old “nobody gets fired for buying IBM” mentality.
I think a lot of engineers who remember the bare metal days have legitimate qualms about going back to the way that world used to work especially before containerization/Kubernetes.
I imagine a lot of people who use Linux/AWS now started out with bare metal Microsoft/VMWare/Oracle type of environments where AWS services seemed like a massive breath of fresh air.
I remember having to put in orders for pallets of servers which then ended up storage somewhere because there were not enough people to carry and wire them up and/or there wasn't enough rack space to install them.
Having an ability to spin up a server or a vm when you need it without having to ask a single question is very liberating. Sometimes such elasticity is exactly what's needed. OTOH other people's servers aren't always the wise choice, but you have to know both environments to make the right choice, and nowadays I feel most people don't really know anything about bare metal.
That only happens when you have your own data center. That's a whole different issue and most people with their own hardware don't have their own data centers as it's not particularly cost efficient except at incredibly large scale.
Luckily, Amazon is far from the only VM provider out there, so this discussion doesn't need to be polarized between "AWS everything" and "on-premise everything". You can rent VMs elsewhere for a fraction of the cost. There are many places that will rent you bare metal servers by the hour, just as if they were VMs. You can even mix VMs and bare metal servers in the same datacenter.
No doubt -- there are plenty of downsides to running your own stuff. I'm not anti-AWS. I'm pro-efficiency, and pro making deliberate choices. If there's a choice is spend $10M extra on AWS because the engineers get a good vibe -- there should be a compelling reason why that vibe is worth $10M. (And there may well be)
Look at what Amazon/Google/Microsoft does. If you told me you advocate running your own power plants, I'd eyeroll. But... if you're as large a power consumer as a hyper-scaler, totally different story. Google and Microsoft are investing in lighting up old nuclear plants.
> It's honestly starting to feel like paid shilling.
the companies selling Cloud are also massive IT giants with unlimited compute resources and extensive online marketing operations.
like of fucking course they're using shillbots, they run the backend shillbot infrastructure.
they literally have LLM chatbot agents as an offering, and it's trivially easy to create fake users and repost / retweet last weeks comments to create realistic looking accounts, when then shill hard for whatever their goals are.
It’s the current version of CCIE or some of the other certs. People pay money to learn how to operate AWS, other thing erode the value of their investment.
A lot of people here's careers have been made by moving into AWS. A lot of people's future careers will be made by moving out of AWS. That's just the tech treadmill in action.
I think some of that is a certain group of people will do anything to play with the new shiny stuff. In my org it's cloud and now GPU.
The cloud stuff is extremely expensive and doesn't work any better than our existing solutions. Like a commentator said below, it's insidious as your entire organization later becomes dependent on that. If you buy a cloud solution, you're also stuck with the vendor deciding to double the cost of the product once you're locked in.
The GPU stuff is annoying as all of our needs are fine with normal CPU workloads today. There are no performance issues, so again...what's the point? Well... somebody wants to play with GPUs I guess.
If your spend is less than a few thousand per month, using cloud services is a no-brainer. For most startups starting up, their spend is minimal, so launching on the cloud is the default (and correct!) option.
Migrating to lower cost options thereafter when scaling is prudent, but you "build one to throw away", as it were.
I'm not either. I used to do fully managed hosting solutions at a datacenter. I had to do everything from hardware through debugging customer applications. Now, people pay me to do the same but on cloud platforms and the occasional on-prem stuff. In general, the younger people I've come across have no idea how to set anything up. They've always just used awscli, the AWS Console, or terraform. I've even been ridiculed for suggesting people not use AWS. Thing is, public cloud really killed my passion for the industry in general.
Beyond public cloud being bad for the planet, I also hate that it drains companies of money, centralizes everyone's risk, and helps to entrench Amazon as yet another tech oligarchic fiefdom. For most people, these things just don't matter apparently.
I think people that lived through the time where their severs are down because the admin forgot to turn them back on after he drove 50 miles back from the colo might not want to live through that again
> I'm so surprised there is so much pushback against this.. AWS is extremely expensive.
I see more comments in favor than pushing back.
The problem I have with these stories is the confirmation bias that comes with them. Going self-hosted or on-premises does make sense in some carefully selected use cases, but I have dozens of stories of startup teams spinning their wheels with self-hosting strategies that turn into a big waste of time and headcount that they should have been using to grow their businesses instead.
The shared theme of all of the failure stories is missing the true cost of self-hosting: The hours spent getting the servers just right, managing the hosting, debating the best way to run things, and dealing with little issues add up but are easily lost in the noise if you’re not looking closely. Everyone goes through a honeymoon phase where the servers arrive and your software is up and running and you’re busy patting yourselves on the back about how you’re saving money. The real test comes 12 months later when the person who last set up the servers has left for a new job and the team is trying to do forensics to understand why the documentation they wrote doesn’t actually match what’s happening on the servers, or your project managers look back at the sprints and realize that the average time spent on self-hosting related tasks and ideas has added up to a lot more than anyone would have guessed.
Those stories aren’t shared as often. When they are, they’re not upvoted. A lot of people in my local startup scene have sheepish stories about how they finally threw in the towel on self-hosting and went to AWS and got back to focusing on their core product. Few people are writing blog posts about that because it’s not a story people want to hear. We like the heroic stories where someone sets up some servers and everything just works perfectly and there are no downsides.
You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.
> I have dozens of stories of startup teams spinning their wheels with self-hosting strategies that turn into a big waste of time and headcount that they should have been using to grow their businesses instead.
Funnily enough, the article even affirms this, though most people seemed to have skimmed over it (or not read it at all).
> Cloud-first was the right call for our first five years. Bare metal became the right call once our compute footprint, data gravity, and independence requirements stabilised.
Unless you've got uncommon data egress requirements, if you're worried about optimizing cloud spend instead of growing your business in the first 5 years you're almost certainly focusing on the wrong problem.
> You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.
This too. Most of the massive AWS savings articles in the past few days have been from companies that do a massive amount of data egress i.e. video transfer, or in this case log data. If your product is sending out multiple terabytes of data monthly, hosting everything on AWS is certainly not the right choice. If your product is a typical n-tier webapp with database, web servers, load balancer, and some static assets, you're going to be wasting tons of time reinventing the wheel when you can spin up everything with redundancy & backups on AWS (or GCP, or Azure) in 30 minutes.
> The shared theme of all of the failure stories is missing the true cost of self-hosting: The hours spent getting the servers just right, managing the hosting, debating the best way to run things, and dealing with little issues add up but are easily lost in the noise if you’re not looking closely.
What the modern software business seems to have lost is the understanding that ops and dev are two different universes. DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems and the role is absolutely no substitute for a systems administrator. Having someone that helps derive the requirements for your infrastructure, then designs it, builds it , backs it up, maintains it, troubleshoots it, monitors performance, determines appropriate redundancy, etc. etc. etc. and then tells the developers how to work with it is the missing link. Hit-by-a-bus documentation, support and update procedures, security incident response… these are all problems we solved a long time ago, but sort of forgot about moving everything to cloud architecture.
> DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems and the role is absolutely no substitute for a systems administrator.
This is revisionist history. DevOps was a reaction to the fact that many/most software development organizations had a clear separation between "developers" and "sysadmins". Developers' responsibility ended when they compiled an EXE/JAR file/whatever, then they tossed it over the fence to the sysadmins who were responsible for running it. DevOps was the realization that, huh, software works between when the people responsible for building the software ("Dev") are also the same people responsible for keeping it running ("Ops").
> DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems
DevOps, conceptually, goes back to the 90s. I was using the term in 2001. If memory serves, AWS didn't really start to take off until the mid/late aughts, or at least not until they launched S3.
DevOps was a reaction to the software lifecycle problem and didn't have anything to do with AWS. If anything it's the other way around: AWS and cloud hosting gained popularity in part due to DevOps culture.
> What the modern software business seems to have lost is the understanding that ops and dev are two different universes.
This is a fascinating take, if you ask me, treating them as separate is the whole problem!
The point of being an engineer is to solve real world problems, not to live inside your own little specialist world.
Obviously there's a lot to be said for being really good at a specialized set of skills, but thats only relevant to the part where you're actually solving problems.
A large part of the different views on this topic are due to the way people estimate the amount of saved effort and money because you're pushing some admin duties to the cloud provider instead of doing this yourself. And people come to vastly different conclusions on this aspect.
It's also that the requirements vary a lot, discussions here on HN often seem to assume that you need HA and lots of scaling options. That isn't universally true.
> A large part of the different views on this topic are due to the way people estimate the amount of saved effort and money because you're pushing some admin duties to the cloud provider instead of doing this yourself. And people come to vastly different conclusions on this aspect
This applies only if you had an extra customer that pays the difference. Basically argument only holds if you can’t take more customers because upkeeping the infrastructure takes too much time or you need to hire extra person which takes more money than AWS bill difference.
> I'm so surprised there is so much pushback against this.. AWS is extremely expensive. The use cases for setting up your system or service entirely in AWS are more rare than people seem to realise. Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?
Long term yes you can save money rolling your own.
But with cloud you can get something up and running within maybe a few days, sometimes even faster. Often with built in scalability.
This is a much easier sell to the non-tech (i.e., money) people.
If the project continues, the path of least resistance is often to just continue with the cloud solution. At a certain point, there will be so much tech debt that any savings from long term costs from the traditional on-premises, co-location or managed hosting, are vastly by the cost of migration.
> Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?
It's a way to "commoditize" engineers. You can run on premise or mixed infra better and cheaper, but only if you know what you are doing. This requires experienced guys and doesn't work with new grad hired by big cons and sold ad "cloud experts".
Also, when something breaks, you are responsible. If you put it in AWS like everyone else and it breaks, then its their problem not yours. We will still implement workarounds and fixes when it happens, but we are not responsible. Basic enterprise rules these days is to always pay someone else to be responsible.
Actually nothing new here, this was the same in the pre-cloud era where everyone in enterprises prefer big names(ibm, microsoft, oracle, ecc) to pass the responsibility to them in case of failures ... aka "nobody get fired because of buying IBM"
You can dodge responsibility equally well by outsourcing to people who'll run your bare metal setup for you. We exist from small consultancies like mine to huge multinationals.
A lot of people here have built their whole professional careers around knowing AWS and deploying to it.
Moving away is an existential issue for them - this is why there's such pushback. A huge % of new developer and devops generation doesn't know anything about deploying software on bare metal or even other clouds and they're terrified about being unemployed.
meanwhile skills in operating systems, networking, and optimization are declining. Every system i've seen in the last 10 years or so has left huge cash on the table by not being aware of the basics.
I'm on a Platform team of <8 people and only 3 of us (most experienced too) come from sysadmin backgrounds. The rest have only ever known containers/cloud and never touched (both figuratively and literally :-) bare metal servers in their careers.
They've never used tools like Ansible (or Anaconda) or been in situations where they couldn't destroy the container and start afresh instantly.
As the author points out AWS can provide a few things that you wouldn’t want to try and replicate (like CloudFront) but for most other things you’re very much correct. AWS is ultimately very expensive for what it is. The complicated billing that’s full of surprises also makes cost management a head-banging experience.
Fair, though using AWS solely for CloudFront would mean you should compare to Cloudflare, Akamai, Fastly, etc. I'm not sure if the value prop for it looks so great if you don't include the "integrated with your other AWS stuff" benefit.
I work for a small company owned by a huge company. We are entirely independent except for purchasing, IT, and budget approval. We run our CI on AWS, and it’s slow and flaky for a variety of reasons (compiling large c++ projects combined with instance type pressure). It’s also expensive.
We planned a migration to move from 4OD instances to one on prem machine and we guessed we’d save $1000/mo, our builds would be faster and we’d have less failures due to capacity issues. We even had a spare workstation and a rack in the office that so the capex was 0.
I plugged the machine into the rack and no internet connectivity. Put in an IT ticket which took 2 days for a reply, only to be told that this was an unauthorised machine and needed to be imaged by IT. The back and forth took 4 weeks, multiple meetings and multiple approvals. My guess is that 4 people spent probably 10 hours arguing whether we should do this or not.
On AWS I can write a python script and have a running windows instance in 15 minutes.
The same story applies for software. If I want to buy a license of X for someone, I have to go through procurement, and it takes weeks even for <$50 purchases. Yet if its on the AWS marketplace it’s pre approved as long is doesn’t breach the AWS budget.
Working around official IT was certainly a significant factor early on. I'm less convinced it is nearly as big a driver (or a downside depending on your perspective) today.
Especially considering that outside of startups (where approval would be fast with or without cloud), virtual infrastructure also got its own bureaucratic process.
A lot of people forget that, when server virtualization was still gaining momentum in a lot of circles, it wasn't uncommon at less technically savvy customers--say a regional bank at the time--to be told that it might take 2 months to provision a new server.
I don't think anyone is forgetting that in this thread, as there's dozens of answers mentioning this.
But as an example: It took about 3 months to provision an AWS server in a recent company I consulted for due to their own bureaucracy and ineptitude of the Ops team.
On the other hand, when I needed a few CI servers for a startup I worked at, I just collected them from AppleStore during lunch hour.
Now this above is what people are "forgetting" and don't want to listen to.
There is this belief that it is not extremely expensive and/or that the ops cost of bare metal will outpace it. It is a belief, and it is very rarely supported by facts.
Having done consulting in this space for a decade, and worked with containerised systems since before AWS existed, my experience is that managing an AWS system is consistently more expensive and that in fact the devops cost is part of what makes AWS an expensive option.
The complexity of AWS versus bare metal depends on what you are doing. Setting up an apache app server: just as easy on bare metal. Setting up high availability MySQL with hot failover: much easier on AWS. And a lot of businesses need a highly available database.
If your database has a hardware failure then you could loose all sales and customer data since your last backup, plus cost of the down time while you restore. I struggle to think of a business where that is acceptable.
Why are you ignoring the huge middle ground between "HA with fully automated failover" and "no replication at all"?
Basic async logical replication in MySQL/MariaDB is extremely easy to set up, literally just a few commands to type.
Ditto for doing failover manually the rare times it is needed. Sure, you'll have a few minutes of downtime until a human can respond to the "db is down" alert and initiates failover, but that's tolerable for many small to medium sized businesses with relatively small databases.
That approach was extremely common ~10-15 years ago, and online businesses didn't have much worse availability than they do today.
I've done quite a few MySQL setups with replication. I would not call setup "extremely easy", but then, I'm not a full time DB admin. MySQL upgrades and general trouble shooting is so much more painful than AWS aurora where everything just takes a few clicks. And things like blue/green deployment, where you replicate your entire setup to try out a DB upgrade, are really hard to do onprem.
Without specifics it's hard to respond. But speaking as a software engineer who has been using MySQL for 22 years and learned administrative tasks as-needed over the years, personally I can't relate to anything you are saying here! What part of async replication setup did you find painful? How does Aurora help with troubleshooting? Why use blue/green for upgrade testing when there are much simpler and less expensive approaches using open source tools?
My "Homeserver" with its database running on an old laptop has less downtime than AWS.
I expect most, if not 99%, of all businesses can cope with a hardware failure and the associated downtime while restoring to a different server, judging from the impact of the recent AWS outage and the collective shrug in response. With a proper raid setup, data loss should be quite rare, if more is required a primary + secondary setup with a manual failover isn't hard.
A high availability MySQL server on AWS is about the same difficulty as on your own kubernetes instance (I've got a play one on one of those $100 N100 machines, got one with 16G mem). Then:
And then you can just provision MariaDB "kind", ie. you kubectl apply with something specifying database name, maximum memory, type of high availability (single primary or multimaster) and secret reference and there you go: new database, ready to be plugged into other pods.
Forget? You have to hire people for that. We are a software organization. We build software. If we rent in the cloud, there is less HR hassle - hiring, raises, bonuses, benefits, firing … none of that headache involved with the cloud.
Technically? Totally doable. But the owners prefer renting in the cloud over the people-related issues of hiring.
This is exactly the rhetoric Microsoft used in the 00's with it's "Get the facts" marketing campaign against Linux and open-source: "Never mind the costs, think about the people hours you are saving!".
It wasn't as simple as that then, at it's still not as simple as that now.
This is true, but also really funny considering that even today the average windows sysadmin can still barely use powershell and relies on console clicking and batch scripts. A good unix admin can easily admin 10-100x the machines as a windows admin, and this was more true back in the early 00s. So the marketing on getting the facts was absolutely false.
In fairness to Microsoft, this argument should have been correct. It ought to be possible for Microsoft to offer products with better polish and better support than open source alternatives, and that ought to more than compensate for any licensing costs. Whether Microsoft actually managed to do this is debatable, but the principle is sound enough.
It sort of was especially with respect to desktop software. The licensing costs associated with Microsoft Office etc. were probably not really that much compared to the disruption with switching offices of people who just wanted to do their job to open source alternatives.
This is the fallacy that Amazon sold everyone on: that the cloud has no headache or managment needed. This is manifestly untrue. It's also untrue that bare metal takes lots of management time. I have multiple Dell rack servers colocated in several different datacenters, and I don't spend any time at all managing them. They just run.
> This is the fallacy that Amazon sold everyone on
I’ve been working at a place for a long time and we have our own data centers. Recently there has been a push to move to the public cloud and we were told to go through AWS training. It seems like the first thing AWS does in its training is spend a considerable amount of time on selling their model. As an employee who works in infrastructure, hearing Amazon sell so hard they the company doesn’t need me anymore is not exactly inspiring.
After that section they seem to spend a considerable amount of time on how to control costs. These are things no one really thinks about currently, as we manage our own infra. If I want to spin up a VM and write a bunch of data to it, no one really cares. The capacity already exists and is paid for, adding a VM here or there is inconsequential. In AWS I assume we’ll eventually need to have a business justification for every instance we stand up. Some servers I run today have value, but it would be impossible to financially justify in any real terms when running in AWS where everything has a very real cost assigned to it. What I do is too detached from profit generation, and the money we could save is mostly theoretical, until something happens. I don’t know how this will play out, but I’m not excited for it.
The AWS mandatory training I did in the past was 100% marketing of their own solutions, and tests are even designed to make you memorize their entire product line.
The first two levels are not designed for engineers: they're designed for "internal salespeople". Even Product Managers were taking the certification, so they would be able to recommend AWS products to their teams.
Every company I’ve consulted for has hired a team dedicated to just setting up and monitoring AWS for the software devs. Hell, you’d probably reduce headcount running on bare metal.
In more than 15 years of experiences, in various compagnies, the number of people who can build and run an on-premise infrastructure sanely can be counted on my right hand fingers
These people exist, but we have far more stupid "admins" around here
When you are not in the infrastructure business (I work in retail at the moment), the public cloud is the sane way to go (which is sad, but anyway)
I have spent about 1 day waiting for every 5 days doing stuff at my last 3 jobs all of which were growing companies thinking that they needed the power of the cloud, but they sure as hell were not paying to make it fast or easy to use.
Pay some "devops" folks and then underfund them and give them a mandate of all ops but with less people and also you need to manage the constant churn of aws services and then deal with normal outages and dumb dev things.
Clients that use cloud consistently end up spending more on devops resources, because their setups tends to be wastly more complex and involve more people.
I've worked on both kinds of companies in almost 25 years and I can confirm this is true.
The biggest ops teams I worked alongside were always dedicated to running AWS setups. The slowest too were dedicated to AWS. Proportionally, I mean, of course.
People here are comparing the worst possible of Bare Metal with "hosting my startup on AWS".
> The biggest ops teams I worked alongside were always dedicated to running AWS setups. The slowest too were dedicated to AWS.
I wish I could come up with some kind of formalization of this issue. I think it has something to do with communication explosions across multiple people.
Just because AWS abstracted something doesn't mean you don't need people who understand all the quirks of the black box you supposedly don't have to worry about. Guess what those people are expensive. You also have to deal with a ton of crap like hard resource account limits that on any meaningful size project will push complexity up by forcing you to use multiple accounts.
Ultimately these owners hire me to cut their 6-figure AWS bill by 50%. It's mostly rearchitecting mistakes. Amongst them is taking AWS blog propaganda at face value. Those savings could be 80% if they chose managed bare metal (no racking and stacking).
> Forget? You have to hire people for that. We are a software organization. We build software.
You don't need to hire dedicated people full time. It could even be outsourced and then a small contract for maintenance.
It's the same argument you could say for "accounting persons", or "HR persons" - "We are a software organisation!" - Personally I don't buy the argument.
Right, doesn't that include figuring out the right and best way of running it, regardless if it runs on client machines or deployed on servers?
At least I take "software engineering" to mean the full end-to-end process, from "Figure out the right thing to build" to "runs great wherever it's meant to run". I'm not a monkey that builds software on my machine and then hands it off to some deployment engineer who doesn't understand what they're deploying. If I'm building server software, part of my job is ensuring it's deployed in the right environment and runs perfectly there too.
I really dislike the fallacy that just because you're buying something it means that you're not building anything. In practice this is never true: there's always some people-in-your-org time cost of buying something just as much as there's some giving-money-to-other-orgs cost to building something. So often organisations wind up buying something and spending way more time in the process than it would cost for them to build it themselves.
With AWS I think this tradeoff is very weak in most cases: the tasks that you are paying AWS for are relatively cheap in time-of-people-in-your-org, and AWS also takes up a significant amount of that time with new tasks as well. Of the organisations I'm personally aware of, the ones who hosted on-prem spent less money on their compute and had smaller teams managing it, with more effective results than those who were cloud-based (to various degrees of egregousness from 'well, I can kinda see how it's worth it because they're growing quickly' to 'holy shit they're setting money on fire and compromising their product because they can't just buy some used tower PCs and plug them in in a closet in the office')
The cloud is incredibly profitable for the efficiencies and improvements its introduced and held onto.
Easy to push back against what is now the unknown (bare metal), when the layers extending bare metal to cloud service have become better and better, as well as more accessible.
I'm not going to argue that AWS can be expensive but in my experience its biggest advantage is SPEED. In every company I worked for that ran their own data centers ever damn thing took FOREVER. new servers took months to buy and rack. any network change like a new VLAN took days to weeks. It was so annoying. But in AWS almost anything is just an API call and a few minutes at most from being enabled. It is so much more productive.
For my org. I don’t have budget for a dedicated in-house opsec team, so if I on-prem it triggers additional salary burden for security . How would I overcome this?
You can't. That's the use case FOR AWS/GCP. Once the differential between having a in-house team and the AWS premium becomes positive is when you make the switch.
A lot of the discussion here is that the cost of the in-house team is less than people think.
For instance: at a former gig, we used a service in the EU that handled weekends, holidays and night time issues and escalated to our team as needed. It was pretty cheap, approximately $10K monthly fee for availability and hourly rate when there were any issues to be resolved. There were a few mornings I had an email with a post-mortem report and an invoice for a hundred euros or so. We came pretty close to 5 9's uptime but we didn't have to worry about SLA's or anything.
There is also the factor that the idea that you don't need administrators for AWS is bullshit. Cool idea, bro. Go to your favorite jobs portal. Search for "devops" ... 1000s of jobs. I click on the first link.
Well, well, they have a whole team doing "devops administration" on AWS and require extra people. So not having the money for an in-house team ... no AWS for you.
I've worked for 2 large-ish firms in the past 3 years. One huge telco, one "medium" telco (still 100s of people). BOTH had a team just for AWS IAM administration. Only for that one thing, because that was company-wide (and was regularly demonstrated to be a single point of failure). And they had AWS administrator teams, yes teams, for every department (even HR had one, though in the medium telco all management had a shared team, but the networking and development departments still had their own AWS teams, who, btw, also did IAM. The company-wide IAM team maintained an AWS IAM and some solution they'd bought that also worked for their windows domain and ticketing system (I hate you IBM remedy), and eqiupment ordering portal and ...)
AND there were "devops" positions on every development team, and on the network engineering team, and even a small one for the building "technics" team.
Oh and they both had an internal cluster on top of AWS, part on-premise, part rented DC space, which did at least half the compute work (but presumably a lot less of the weird edge-cases), that one ran the company services that are just insane on AWS like any kind of video.
they sell "you don't need a team"... which is true om your prototype and mvp phase. and you know when you grow you will have an ops team and maybe move out.
but in the very long middle time... you will be supporting clients and sla etc, and will end up paying both aws AND an ops team without even realizing.
If you don't have budget for someone to handle this for you, you can't afford AWS either, as you still need to handle the same things and they're generally more complex when you use AWS.
Familiarize yourself with your company’s decision process on strategic decisions like this. Ensure you have a way to submit a proposal for a decision on making the change (or find someone who has that access to sponsor your proposal), build a business case that shows cost of opsec team, hardware and everything else is lower than AWS (or if cost is higher then some other business value is gained from making the change — currently digital sovereignty could be a strong argument if you are EU based).
If you cant build a positive business case then its not the correct move. Cash is king. Sadly.
The consequence of running ingress and DNS poorly is downtime.
The consequence of running a database poorly is lost data.
At the end of the day they're all just processes on a machine somewhere, none of it is particularly difficult, but storing, protecting, and traversing state is pretty much _the_ job and I can't really see how you'd think ingress and DNS would be more work than the datastores done right.
Now with AWS, I have a SaaS that makes 6 figures and the AWS bill is <$1000 a month. I'm entirely capable of doing this on-prem, but the vast majority of the bill is s3 state, so what we're actually talking about is me being on-call for an object store and a database, and the potential consequences of doing so.
With all that said, there's definitely a price point and staffing point where I will consider doing that, and I'm pretty down for the whole on-prem movement generally.
I'm generally strongly in favour of bare metal (not so much actually on prem) but your case is one of the rare cases wher AWS makes sense. Even for cheap setups like that, bare metal could likely be cheaper even factoring in someone on call to handle issues for you, but the amounts are so small it's a perfectly reasonable choice to just pick whatever you're comfortable with.
That's the sweet spot for AWS customers. Not so much for AWS.
The key thing for AWS is trying to get you locked in by "helping you" depend on services that are hard to replicate elsewhere, so that if your costs grow to a point where moving elsewhere is worth it, it's hard for you to do so.
It’s expensive and the “design” of the services, if you could call it that, is such that you are forced to pay a lot, or play a lot of games to get around it. If you are going to spend your engineering time working around their ridiculous pricing schemes, you might as well spend the money on building things out yourself.
Perfect example - MSK. The brokers are config locked at certain partition counts, even if your CPU is 5%. But their MSK replicator is capped on topic count. So now I have to work around topic counts at the cluster level, and partition counts at the broker level. Neither of which are inherent limits in the underlying technologies (kafka and mirrormaker)
AWS (along with the vast majority of B2B services in the software development industry) is good because it allows you to focus on building your product or business without needing to worry about managing servers nearly as much.
The problems here are no different than using SaaS anywhere else in a business, you can also run all your sales tracking through excel, it's just that once you have more than a few people doing sales that becomes a major bottleneck the same way not having an easier to manage infrastructure system.
In the early days of cloud service providers, they offered a handful of high-value services, all at great prices, making them cost-competitive with bare metal but much easier. That was then.
Things today are different. As cloud service providers have grown to become dominant, they now offer a vast, complicated tangle of services, microservices, control panels, etc., at prices that can spiral out of control if you are not constantly on top of them, making bare metal cheaper for many use cases.
> they offered a handful of high-value services, all at great prices, making them cost-competitive with bare metal but much easier
That was never the case for AWS, the point was never "We're cheap" but "We let you scale faster for a premium".
I first came across cloud services around 2010-2011 I think, when the company I worked at at the time started growing and we needed something better than shared hosting. AWS was brought up as a "fresh but expensive" alternative, and the CTO managed to convince the management that we needed AWS even if it was expensive, because it'll be a lot easier to tear up/down servers as we need it. Bandwidth costs I think was the most expensive part of the package, at least back then.
When I look at what performance per $ you get with AWS et al today, it looks the same, incredibly expensive for the performance you (don't) get. Better off with dedicated instances unless you team is lacking the basic skills of server management, or until the company really grown so it keeps being difficult dealing with the infrastructure, then hire a dedicated person and let them make the calls for what's next.
I'd agree that AWS never sold on being cheaper, but there is one particular way AWS could be cheaper and that is their approach to billing-by-the-unit with no fixed costs or minimum charges.
Being able to start small from a $1/mth bill without any fixed cost overheads is incredibly powerful for small startups.
If I wanted to store bytes in a DC it would cost $10k/mth by the time I was paying colo/ servers/ disks before I stored my first byte. Sure there wouldn't be any incremental costs for the second byte but thats a steep jump. S3 would have cost me $0.02. Being able to try technology and prove concepts at the product development stage is very powerful and why AWS became not just a vendor but a _technology partner_ for many companies.
> Being able to start small from a $1/mth bill without any fixed cost overheads is incredibly powerful for small startups.
Yes, no doubt about it. Initially AWS was mostly sold as "You never know when you might want to scale fast, imagine being featured in a newspaper and your servers can't handle the load, you need cloud for that!" to growing startups, and in that context it kind of makes sense, pay extra but at least be online.
But initially when you're small, or later when you're big and establish, other things make more sense. But yes, I agree that if you need to aggressively be able to scale up or down, cloud resources make sense to use for that, in addition to your base infrastructure.
But if AWS didn't have that anti-competitive data transfer fee that gets waived if your traffic goes to an internal server, why would you choose S3 vs a white-label storage vendor's similar offering?
> the point was never "We're cheap" but "We let you scale faster for a premium"
Actually, it was more like "Scale faster, easier, more reliably, with proven hardware and software infrastructure, operated by a proven organization, at a price point that is competitive with the investment you'd have to make to get comparable hardware, software, and organizational infrastructure." But that was then. Today, things are different. Cloud services have become giant hairballs of complexity, with plenty of shoot-yourself-in-the-foot-by-default traps, at prices that can quickly spiral out of control if you're not on top of them.
This. When AWS was 10 solid core services it made sense and was exciting. It’s now a bloated mess of 200+ services (many of which almost nobody uses) with all that complexity starting to create headaches and cracks.
AWS needs to stop trying to have a half-arsed solution to every possible use case and instead focus on doing a few basic things really well.
Imo the fact that an "AWS Certified Solutions Architect" is yet another AWS service/thing that is attainable, via an actual exam[0] for $300, is indicative of just how intentionally bloated the entire system has become.
(Real question, not meant to be sarcastic or challenging!) -- What are the challenges in trying to use just the ~10 core services you want/need and ignoring the others? What problems do the others you don't use cause with this use case?
A lot of newer stuff that actually scales (so Lightsail doesn't count) is entangled with "security", "observability" and "network" services. So if you just want to run EC2 + RDS today, you also have to deal with VPC, Subnets, IAM, KMS, CloudWatch, CloudTrail, etc.
Since security and logs are not optional, you have very limited choice.
Having that many required additional services means lots of hidden charges, complexity and problems. And you need a team if you're not doing small-scale stuff.
They used to release new ec2 sizes at the same price as the previous gen which made upgrading a no brainer. That stopped with m7 and doesn’t seem to be coming back.
Not sure what Amazon plans to do when the m6 hardware starts wearing out.
"Embrace, extend, extinguish". It was a Microsoft saying, but it explains Amazon's approach to Linux. Once your customers are skilled in how to do things on your platform, using your specialized products, they won't price-comparison (or compare in any other way) to competing options. Whether those countless other "half-arsed solutions" actually make money is beside the point; as long as the customer has baked at least one into their tech stack, they can't easily leave.
I don’t think I’ve seen a menu as hilariously bad as the AWS dashboard menu. No popup menu should consume the entire screen edge to edge. Just a wall of cryptic service names with ambiguous icons.
Word on the street is that Amazon leadership basically agrees with this and recognizes things have gotten off course. AWS is a small number of things that make money and then a whole bunch of slop and bloat.
AWS was mostly spared from yesterday’s big cuts but have been told to “watch this space” in the new year after re:Invent.
Anytime I have to go into the AWS control panel (which is often) I am immediately overwhelmed with a sense of dread. It's just the most bloated overcomplicated thing I could possibly imagine.
...while on the other side, the "traditional" hosting/colocation providers feel the squeeze and have to offer more competitive prices to stay in business?
(1) Massive expansion of budget (100 - 1000x) to support empire building. Instead of one minimum-wage sysadmin with 2 high-availability, maxed-out servers for 20K - 40K (and 4-hour response time from Dell/HPE), you can have 100M multi-cloud Kubernetes + Lambda + a mix-and-match of various locked-in cloud services (DB, etc.). And you can have a large army of SRE/DevOps. You get power and influence as a VP of Cloud this and that and 300 - 1000 people reporting to you.
(2) OpEx instead of CapEx
(3) All leaders are completely clueless about hiring the right people in tech. They hire their incompetent buddies who hire their cronies. Data centers can run at scale with 5-10 good people. However, they hire 3000 horrible, incompetent, and toxic people, and they build lots of paperwork, bureaucracy, and approvals around it. Before AWS, it was VMware's internal cloud that ran most companies. Getting bare metal or a VM will take months to years, and many, many meetings and escalations. With AWS, here is my credit card, pls gimme 2 Vms is the biggest feature.
Not really a hardcore infra guy, but on the coding side, I know companies with products that have codebases in the multi million LoC range written over decades, one of my friends interned there and told me they didn't even let him work on the core product for months, they put him on some custom testing framework they had for it, just so he could get familiar enough with the core code to be able to contribute meaningfully.
He told me that before they started doing that, there were incidents like teams writing entire modules they didn't know already existed - now there were 2 pieces of code doing basically the same thing, that were just incompatible enough to not be possible to merge them.
> Our workload is 24/7 steady. We were already at >90% reservation coverage; there was no idle burst capacity to “right size” away. If we had the kind of bursty compute profile many commenters referenced, the choice would be different.
Which TBH applies to many, many places, even if they are not aware of it.
I'd say the core of their success is running everything in a single rack in a single datacenter at first (for months? a year?) and getting lucky. Life is simple when you don't need the costs and effort of reliability upfront.
They mention having a second half-rack in a different DC.
In any case, not everyone need five nines, and usually it's just much easier to bring down a platform due to some bug in your own software rather that the core infrastructure going down at a rack level.
The point is valid, they mention adding that, so at one point they didn't have that. They're also only storing monitoring & observability data, that's never going to be mission critical for their customers.
It's probably the main reason why they were able to get away with this and why their application does not need scalability. I see they themselves are only offering two 9s of uptime.
I had a problem figuring out why the place I was working wanted to move from in-house to AWS; their workload was easily handled by a few servers, they had no big bursts of traffic, and they didn't need any of the specialized features of AWS.
Eventually, I realized that it was because the devs wanted to put "AWS" on their resumes. I wondered how long it would take management to catch on that they were being used as a place to spruce up your resume before moving on to catch bigger fish.
But not long after, I realized that the management was doing the same thing. "Led a team migration to AWS" looked good on their resume, also, and they also intended to move on/up. Shortly after I left, the place got bought and the building it was in is empty now.
I wonder, now that Amazon is having layoffs and Big Tech generally is not as many people's target employer, will "migrated off of AWS to in-house servers" be what devs (and management) want on their resume?
Devs wanting to put AWS on their resume push for it, then the next wave you hire only knows AWS.
And then discussions on how to move forward are held between people that only know AWS and people who want to use other stuff, but only one side is transparent about it.
Many other points. When the Cloud Started, they offered great value in adjacent product and services. Scaling was painful, getting bare metal hardware have long lead time, provisioning takes time. DC was not of as high quality, Network wasn't as redundant. A lot of these today are much less of an issue.
In 2010 you could only get 64 Core Xeon CPU coming in 8 Sockets, or maximum or 8 Core per socket. And that is ignoring NUMA issues. Today you could get 256 Core per socket that is at least twice as fast per core. What used to be 64 Server could now be fitted into 1. And by 2030, it would be closer to 100 to 1 ratio. Not to mention Software on Server has gotten a lot faster compared to 2010. PHP, Python, Ruby, Java, ASP or even Perl. If we added up everything I wouldn't be surprised we are 200 or 300 to 1 ratio compared to 2010.
I am pretty sure there is some version of Oxide in the pipeline that will catch up to latest Zen CPU Core. If a server isn't enough, a few Oxide Rack should fit 99% of Internet companies usage.
> Cloud makes sense when elasticity matters; bare metal wins when baseload dominates.
This really is the crux of the matter in my opinion, at least for applications (databases and so on is in my opinion more nuanced). I've only worked at one place where using cloud functions made sense (keeping it somewhat vague here): data ingestion from stations that could be EXTREMELY bursty. Usually we got data from the stations at roughly midnight every day, nothing a regular server couldn't handle, but occasionally a station would come back online after weeks or new stations got connected etc which produced incredible load for a very short amount of time when we fetched, parsed and handled each packet. Instead of queuing things for ages we could instead just horizontally scale it out to handle the pressure.
FD: I work at Amazon, I also started my career in a time where I had to submit paper requests for servers that had turn around times measured in months.
I just don't see it. Given the nature of the services they offer it's just too risky not to use as much managed stuff with SLAs as possible. k8s alone is a very complicated control plane + a freaking database that is hard to keep happy if it's not completely static. In a prior life I went very deep on k8s, including self managing clusters and it's just too fragile, I literally had to contribute patches to etcd and I'm not a db engineer. I kept reading the post and seeing future failure point after future failure point.
The other aspect is there doesn't seem to be an honest assessment of the tradeoffs. It's all peaches and cream, no downsides, no tradeoffs, no risk assessment etc.
The article’s deployment has a spare rack in a second DC and they do a monthly cutover to AWS in case the colo provider has a two site issue.
Spending time on that would make me sleep much better than hardening a deployment of etcd running inside a single point of failure.
What other problems do you see with the article? (Their monthly time estimates seem too low to me - they’re all 10x better than I’ve seen for well-run public cloud infrastructure that is comparable to their setup).
Managing a complex environment is hard, no matter whether that’s deployed on AWS or on prem. You always need skilled workers. On one platform you need k8s experts. On the other platform you need AWS experts. Let’s not pretend like AWS is a simple one-click fire and forget solution.
And let’s be very real here: if your cloud service goes down for a few hours because you screwed something up, or because AWS deployed some bad DNS rules again, the world moves on. At the end of the day, nobody gives a shit.
I agree that a business should use Kubernetes only if there is a clear need for that level of infrastructure automation. It's a time and money mistake to use K8s by default.
Many startups and companies couldn't exist if there was only AWS (or GCP / Azure) due to how much they overcharge.
For example, we couldn't offer free GeoIP downloads[0] if we were charged the outrageous $0.09 / GB, and the same is true for companies serving AI models or game assets.
But what makes me almost sick is how slow is the cloud. From network-attached disks to overcrowded CPUs, everything is so slooooow.
My experience is that the cloud is a good thing between 0-10,000 $ / month. But you should seriously consider renting bare-metal servers or owning your own after that. You can "over-provision" as much as you want when you get 10-20x (real numbers) the performance for 25% of the price.
I’ve seen cloud slowness create weird Stockholm syndrome effects, especially around disk latency.
It always makes sense to compare to back of the envelope bare metal numbers before rearchitecting your stack to work around some dumb cloud performance issue.
> Equinix Metal got the closest, but bare metal on-demand still carried a 25-30% premium over our CapEx plan. Their global footprint is tempting; we may still use them for short-lived expansion.
> The Equinix Metal service will be sunset on June 30, 2026.
I put our company onto a hybrid AWS-colocation setup to attempt to get the best of both worlds. We have cheap fiddly/bursty things and expensive stable things and nothing in between. Obviously, put the fiddly/bursty things in AWS and put the stable things in colocation. Direct Connect keeps latency and egress costs down; we are 1 millisecond away from us-east-1 and for egress we pay 2¢/GB instead of the regular 9¢/GB. The database is on the colo side so database-to-AWS reads are all free ingress instead of egress, and database-to-server traffic on the colo side doesn't transit to AWS at all. The savings on the HA pair of SQL Server instances is shocking and pays for the entire colo setup, and then some. I'm surprised hybrids are not more common. We are able to manage it with our existing (small) staff, and in absolute terms we don't spend much time on it--that was the point of putting the fiddly stuff in AWS.
The biggest downside I see? We had to sign a 3 year contract with the colocation facility up front, and any time we want to change something they want a new commitment. On AWS you don't commit to spending until after you've got it working, and even then it's your choice.
Small team in a large company who has an enterprise agreement (discount) with a cloud provider? The cloud can be very empowering, in that teams who own their infra in the cloud can make changes that benefit the product in a fraction of the time it would take to work those changes through the org on prem. This depends on having a team that has enough of an understanding of database, network and systems administration to own their infrastructure. If you have more than one team like this, it also pays to have a central cloud enablement team who provides common config and controls to make sure teams have room to work without accidentally overrunning a budget or creating a potential security vulnerability.
Startup who wants to be able to scale? You can start in the cloud without tying yourself to the cloud or a provider if you are really careful. Or, at least design your system architecture in such a way that you can migrate in the future if/when it makes sense.
Edit: For clarity, wikipedia does also have pages with other meanings of "bare metal", including "bare metal server". The above link is what you get if you just look up "bare metal".
I do aim to be some combination of clear, accurate and succinct, but I very often seem to end up in these HN pissing matches so I suppose I'm doing something wrong. Possibly the mistake is just commenting on HN in itself.
Seems there is a difference between "Bare Metal" and "Bare Machine".
I'm not sure what you did, but when you go to that Wikipedia article, it redirects to "Bare Machine", and the article contents is about "Bare Machine". Clicking the link you have sends you to https://en.wikipedia.org/wiki/Bare_machine
So it seems like you almost intentionally shared the article that redirects, instead of linking to the proper page?
There is nothing that needs fixing? Both my link and yours give the same "primary" definition for "bare metal". Which is not unequivocally the correct definition, but it's the one I and the person I was replying to favour.
I thought my link made the point a bit better. I think maybe you've misunderstood something about how Wikipedia works, or about what I'm saying, or something. Which is OK, but maybe you could try to be a bit more polite about it? Or charitable, to use your own word?
Edit: In case this part isn't obvious, Wikipedia redirects are managed by Wikipedia editors, just like the rest of Wikipedia. Where the redirect goes is as much an indication of the collective will of Wikipedia editors as eg. a disambiguation page. I don't decide where a request for the "bare metal" page goes, that's Wikipedia.
In similar way I once worked on a financial system, where a COBOL-powered mainframe was referred to as "Backend", and all other systems around it written in C++, Java, .NET, etc. since early 80s - as "Frontend".
Had somewhat similar experience, the first "frontend" I worked on was a sort of proxy server that sat in front of a database basically, meant as a barrier for other applications to communicate via. At one point we called the client side web application "frontend-frontend" as it was the frontend for the frontend.
I don't work in firmware at all, but I'm working next to a team now migrating an application from VMs to K8S, and they refer to the VMs as "bare metal" which I find slightly cringeworthy - but hey, whatever language works to communicate an idea.
I'm not sure I've ever heard bare metal used to refer to virtualized instances. (There were debates around Type 1 and Type 2 (hosted) hypervisors at one point but haven't heard that come up in years.
Several years off AWS, the only thing I still prefer AWS for is SES, otherwise Cloudflare has the more cost effective managed services. For everything else we use Hetzner US Cloud VMs for hosting all App Servers and Server Software.
Our .NET Apps are still deployed as Docker Compose Apps which we use GitHub Actions and Kamal [1] to deploy. Most Apps use SQLite + Litestream with real-time replication to R2, but have switched to a local PostgreSQL for our Latest App with regular backups to R2.
Thanks to AI that can walk you through any hurdle and create whatever deployment, backup and automation scripts you need, it's never been easier to self-host.
>We're now moving to Talos. We PXE boot with Tinkerbell, image with Talos, manage configs through Flux and Terraform, and run conformance suites before each Kubernetes upgrade.
Gee, how hard is to find SE experts in that particular combination of available ops tools? While in AWS every AWS certified engineer would speak the same language, the DIY approach surely suffers from the lack of "one way" to do things. Change Flux with Argo for example (assuming the post is talking about that Flex and no another tool with the same name), and you have a almost completely different gitops workflow. How do they manage to settle with a specific set of tools?
What people forget about the OVH or Hetzner comparison is that for those entry servers they are known for, think the Advance line with OVH or AX with Hetzner. Those boxes come with some drawbacks.
The OVH Advance line for example comes without ECC memory, in a server, that might host databases. It's a disaster waiting to happen. There is no option to add ECC memory with the Advance line, so you have to use Scale or High Grade servers, which are far from "affordable".
Hetzner per default comes with a single PSU, a single uplink. Yes, if nothing happens this is probably fine, but if you need a reliable private network or 10G this will cost extra.
I can't believe how affordable Hetzner is. I just rented a bare metal 48 core AMD EPYC 9454P with 256 GB of ram and two 2 TB NVME ssds for $200/month (or $0.37 per hour). Its hard to directly compare with AWS, but I think its about 10x cheaper.
Their current advance offerings use AMD EPYC 4004 with on-die ECC. I can’t figure out if it’s “real” single correction double detection, or if the data lines between the processor and dimms are protected or not though.
Yes, but there are options for dedicated server providers who offer dual PSU and ECC ram etc. It's more expensive though for e.g a 24 Core Epyc with 384GB RAM dual 10G netowork is like $500/month (though there's smaller servers on serversearcher.com for other examples)
These concerns are exaggerated. I've been running on Hetzner, OVH and friends for 20 years. During that time I've had only two issues, one about 15 years ago when a PSU failed on one of the servers, and another a few years ago when an OVH data center caught fire and one of the servers went down. There have been no other hardware issues. YMMV.
They matter at scale, where 1% issues end up happening on a daily or weekly basis.
For a startup with one rack in each of two data centers, it’s probably fine. You’ll end up testing failover a bit more, but you’ll need that if you scale anyway.
If it’s for some back office thing that will never have any load, and must not permanently fail (eg payroll), maybe just slap it on an EC2 VM and enable off-site backup / ransomware protection.
I'm pretty sure they keep internal internal checksums at various points to make sure the data on disk is intact - so does the filesystem, I think they can catch when memory corruption occurs, and can roll back to a consistent state (you still get some data loss).
But imo, systems like these (like the ones handling bank transaction), should have a degree of resiliency to this kind of failure, as any hw or sw problem can cause something similar.
Doesn't make me want to be a Equinix customer when they just randomly shut down critical hosting services.
I'm pretty sure that it's just the post-merger name for Packet which was an incredible provider that even had BYO IP with an anycast community. Really a shame that it went away, it was a solid alternative to both AWS and bare metal and prices were pretty good.
There's a missing middle between ultra expensive/weird cloud and cheap junk servers that I would really love to see get filled.
I have seen multiple startups paying thousands of dollars a month in AWS bills to run a tiny service which could trivially run on an $800 desktop on a residential internet connection. It's absolutely tragic.
That’s like $24K a year. Assuming they have working failover and business continuity plans, it’s actually a really good deal (vs having a 10-20% time employee deal with it).
Curious to know how's the development experience been post-migration?
Was there additional friction due to lack of tooling in on-prem that would otherwise available in the cloud env for example?
They were running for a long time (months? over a year?) on a single rack in a single datacenter. Eventually they scaled out but the word is eventually. I think that summarizes both sides of this debate in a nutshell. You can move off of AWS but unless you invest a lot you will take on increased risk. Maybe you'll get lucky and your one rack won't burn down. Maybe you won't. They did get lucky.
Recently i learned that orgs these days want to show software and infrastructure spend as capex as they can shown it as depreciating asset for tax purposes.
I understand that with AWS you cannot do that as it is often seem as opex.
I guess thats a good enough motivation to move out of AWS at scale.
Talos is great until it's not. We ran into Ceph IO speed bottlenecks and found it was impossible to debug ("talosctl cgroups —preset=io" is a mess) because the devs didn't want to add an SSH escape hatch into their black box OS. Our Talos nodes would also randomly become unhealthy and you have no way of knowing why. Switched to PXE booted Alpine linux with vanille k8s, and we had a much more stable experience with no surprises, and the ability to SSH whenever we want has been hugely helpful.
The thing I find counter intuitive about AWS and hyper-scalers in general is, they make so much sense when you are starting out a new project. A few VMs, some gigs of data storage, you are off to the races in a day or two.
As soon as you start talking about any kind of serious data storage and data transfer the costs start piling up like crazy.
Like in my mind, the cost curve should flatten out over time. But that just doesn't seem to be the reality.
Ok so this may be a dumb question...but now do you handle ISP outages due to storms and stuff with on prem solutions? I'd imagine large datacenters have much more sophisticated and reliable internet connections than say an Xfinity business customer, but maybe that's wrong.
Much more sophisticated and reliable than Xfinity.
Good datacenters have redundant and physically separated power and communication from different providers.
Also, in case something catastrophic happens at one datacenter, the author mentions they are peered to another datacenter in a different country, as another layer of redundancy. Cloudflare handles their ingress, so such a catastrophic event wouldn't likely to be noticed by their customers.
Never heard of Talos before now. That looks pretty cool and I might start playing with that on my home lab. Can't use it at work for reasons, but good to keep on top of tech (even if I am a little behind)
This dude did a complete walkthrough setting up a Talos cluster on bare metal: https://datavirke.dk/posts/bare-metal-kubernetes-part-1-talo... It's a nice read. I have my own Talos cluster running in my homelab now for over a year with similar stuff (but no Ceph).
Did you read the article ? The main point of this and the prior article is that YES colocation/baremetal IS a better option for this company (and I would argue the majority of AWS users)
I love the argument that Managed DBs cost a lot, but they're supposedly safer. Meanwhile people can't figure out the IAM permission models so they give the entire world access with root:root.
Worth checking out the different server hosts. You can get a cheap OVH server with 64GB of RAM, 4-6cores with 2TB of disk space from OVH for $30, better servers for $70 with 1gbps - 2gbps bandwidth.
Setting up a DB isn't hard, using an LLM to ask questions will guide you to the right places. I'm always talking with Gemini because I switched from Ubuntu to Fedora 42 server and things are slightly different here and there.
But, different server hosts offer DB-ready OS's so all you have to do is load the OS on the server and you'll be ready to go.
The joy of Linux is getting everything _just right_ and so much _just right_ that you can launch a second server and set it up that way _just right_ within minutes.
This is a tech company and it’s adjacent to their core competency. Most companies wouldn’t know MicroK8s from a brand of cereal, they’d only create a mess if they tried this themselves.
One thing I can say definitively, as someone who is definitely not an AI zealot (more of an AI pragmatist): GPT language models have reduced the barrier of running your own bare metal server. AWS salesfolk have long often used the boogeyman of the costs (opportunity, actual, maintenance) of running your own server as the reason you should pick AWS (not realizing you are trading one set of boogeymen for another), but AI has reduced a lot of that burden.
I really like how people throw around these baseless accusations.
S3 is one of the cheapest storage solutions ever created. The last 10 years I have migrated roughly 10-20PB worth of data to AWS S3 and it resulted in significant cost saving every single time.
If you do not know how to use cloud computing than yes, AWS can be really expensive.
Assuming those 20PB are hot/warm storage, S3 costs roughly $0.015/GB/month (50:50 average of S3 standard/infrequent access). That comes out to roughly $3.6M/year, before taking into account egress/retrieval costs. Does it really cost that much to maintain your own 20PB storage cluster?
If those 20PB are deep archive, the S3 Glacier bill comes out to around $235k/year, which also seems ludicrous: it does not cost six figures a year to maintain your own tape archive. That's the equivalent of a full-time sysadmin (~$150k/year) plus $100k in hardware amortization/overhead.
The real advantage of S3 here is flexibility and ease-of-use. It's trivial to migrate objects between storage classes, and trivial to get efficient access to any S3 object anywhere in the world. Avoiding the headache of rolling this functionality yourself could well be worth $3.6M/year, but if this flexibility is not necessary, I doubt S3 is cheaper in any sense of the word.
Like most of AWS, it depends if you need what it provides. A 20PB tape system will have an initial cost in the low to mid 6 figures for the hardware and initial set of tapes. Do the copies need to be replicated geographically? What about completely offline copies? Reminds me of conversations with archivists where there's preservation and then there's real preservation.
How the heck does anyone have that much data? I once built myself a compressed plaintext library from one of those data-hoarder sources that had almost every fiction book in existence, and that was like 4TB compressed (but would've been much less if I bothered hunting for duplicates and dropped non-English).
I suspect the only way you could have 20PB is if you have metrics you don't aggregate or keep ancient logs (why do you need to know your auth service had a transient timeout a year ago?)
Lots of things can get to that much data, especially in aggregate. Off the top of my head: video/image hosting, scientific applications (genomics, high energy physics, the latter of which can generate PBs of data in a single experiment), finance (granular historic market/order data), etc.
In addition to what others have mentioned, before the "AI bubble", there was a "data science bubble" where every little signal about your users/everything had to be saved so that it could be analyzed later.
The implicit claims are more misleading, in my opinion: The claim that self-hosting is free or nearly free in terms of time and engineering brain drain.
The real cost of self-hosting, in my direct experience with multiple startup teams trying it, is the endless small tasks, decisions, debates, and little changes that add up over time to more overhead than anyone would have expected. Everyone thinks it’s going to be as simple as having the colo put the boxes in the rack and then doing some SSH stuff, then you’re free of those AWS bills. In my experience it’s a Pandora’s box of tiny little tasks, decisions, debates, and “one more thing” small changes and overhauls that add up to a drain on the team after the honeymoon period is over.
If you’re a stable business with engineers sitting idle that could be the right choice. For most startups who just need to get a product out there and get customers, pulling limited headcount away from the core product to save pennies (relatively speaking) on a potential AWS bill can be a trap.
Running EKS on AWS was their problem. If they didn't run EKS on AWS, they would've had a considerably simpler setup running Amazon Linux, not having to upgrade Kubernetes every 3 quarters, managing network security using security groups instead of having open internal networking, and running in a single AZ would've eliminated intra-AZ costs. In large data centers like us-east-1, an individual AZ is actually internally striped for extra redundancy, and you are much more likely to experience regional downtime than single AZ downtime, especially if you have a stable workload and do not rely on tech beyond rock-solid basics (EC2, VPC, ELB, S3, EBS). If you're willing to operate a single bare metal rack in a DC, you should be willing to run in a single AWS AZ.
I don't know how much time they spend configuring/dealing with Kubernetes, but I bet it's a large chunk of the 24 hour engineer-hours per quarter. But this is not a required expense: "EKS had an extra $1,260/month control-plane fee". Running EKS adds a massive IAM policy maintenance overhead, whereas a non-EKS (EC2 w/ golden AMIs) setup results in drastically simpler IAM policies.
NAT gateways are ~$50 a month, plus data transfer. Setting up a gateway VPC endpoint to S3 will avoid having to pay transfer charges to S3.
They were at 90% reservation capacity, so they should be using reservations for greater savings and in fact, running stable workloads with reservations is something that AWS excels at. Reservation means that you will be able to terminate and re-launch instances even when there's a spike in demand from other users--your instance capacity is guaranteed.
Running the basics on VMs also effectively avoids vendor lock-in. Every cloud provider supports VMs with a RedHat clone, VPCs, load balancing, networked storage, access controls, object storage and a fixed size fleet with auto-relaunch on instance failure.
With a consistent workload, they would have very likely escaped the downtime from AWS a week ago as well, because, as per AWS, "existing EC2 instances that had been launched prior to the start of the event remained healthy and did not experience any impact for the duration of the event".
With Terraform and automation for building launchable images, you can stand up a cluster quickly in any region with secure networking, including in a separate AWS account, in the same region, for the sake of testing.
With AWS, you can set up automatic EBS backups of all your data to snapshots trivially, and even send them to a 3rd locked-down account, so they can't be accidentally wiped.
AWS is extremely expensive, and I think I have to agree with DHH's assessment that many developers are afraid of computers. AWS is taking advantage of that fear of actually just setting up linux and configuring a computer.
However to steelman AWS use. Many businesses are STILL running mainframes. Many run terrible setups like Access as a production database. In 2025 there are large companies with no CICD platforms or IAC, and some companies where even VC is still a new concept or a dark art. So not every company is in the position to actually hire competent system administrators and system engineers to set up some bare metal machines and configure Ceph, much less Hadoop or Kubernetes. So AWS lets these companies just buy this capabilities while forcing the software stack to modernize.
I worked at a company like this, I was an intern with wide eyes seeing the migration to git via bitbucket in the year ... 2018? What a sight to see.
That company had its own data center, tape archives, etc. It had been running largely the same way continuously since the 90s. When I left for a better job, the company had split into two camps. The old curmudgeonly on-prem activists and the over-optimistic cloud native AWS/GCP certified evangelist with no real experience in the cloud (because they worked at a company with no cloud presence). I'm humble enough to admit that I was part of the second camp and I didn't know shit, I was cargo culting.
This migration is still not complete as far as I'm aware. Hopefully the teams that resisted this long and never left for the cloud get to settle in for another decade of on-prem superiority lol.
I was a at a company that was doing their SVN/Jenkins migration to Git/Bitbucket/Bamboo around 2016/2018. But they were using source control and a build system already, so you have to hand it to them. But I have an associate that was at one of the large health insurance companies in 2024, complaining that he couldn't get them to use git and stop deploying via FTP to a server. There is danger with being too much on the cargo cult side, but also danger with being too resistant to change. I don't know how you can look at source control, a CICD pipeline, artifacts, IaC, and say "This looks like a bad idea".
Microk8s has common, catastrophic performance bugs. There are also catastrophic problems with microk8s Ceph addons. So is this post true? Microk8s, for people who know stuff, is a canary for clusters / applications that don’t really work.
There is so much hidden cost in maintaining your own bare metal infrastructure. I am always astounded by how people overlook the massive opportunity cost involved in not only setting up, securing, and maintaining your bare metal infrastructure, but also make it state of the art, including best practices, making sure you have required uptime, monitoring and intervening if necessary. - I work in a highly regulated market with 700 coworkers, our IT maintains an endless amount of VMs. And you cannot imagine how much more work they have to do compared to a setup where you spin up services in AWS or Azure. And destroy it when you don’t need it. No updates, no patches. No misconfiguration. Not every company uses automation either (chef, ansible and whatnot)
I agree, I have a restaurant POS system and I think self-hosting would easily kill the product velocity, and if we screw up bad, even the company.
However, I do get the point about cost-premium and more importantly vendor-risk that's paid when using managed services.
We are hosted on cloudflare workers which is very cheap, but to mitigate the vendor risk we have also setup up replicas of our api servers on bunny.net and render.com.
This is a completely meaningless article if they don't provide information about their technical stack, which AWS services they used to use, what TPS they are hitting, what storage size they're using, etc.
The story will be different for every business because every business has different needs.
Given the answer to "How much did migration and ongoing ops really cost?" it seems like they had an incredibly simple infrastructure on AWS, and it was really easy to move out. If you use a wider-range of services the cost savings are much more likely to cancel themselves.
Sounds like they did the right thing for their business model.
I think as AWS grows and changes the curve of the target audience is changing too. The value proposition is "You can get Cloud service without having a dedicated Cloud team," but there are caveats:
- AWS is complicated enough that you will still need a team to integrate against it. The abstractions are not free and the ones that are leaky will bite you without dedicated systems engineers to specialize in making it work with your company's goals.
- For small companies with little compute need, AWS is a good option. Beyond a certain scale... It is worth noting that big companies build their own datacenters, they don't rely on someone else's Cloud. Amazon, Google, and Microsoft don't run on each other.
- Recently, the cost model has likely changed if a company pokes their head up and runs the numbers, there's, uh, quite a few engineers with deep knowledge of how to build a scalable cloud infrastructure available to hire now for some reason. In fact, a savvy company keeping its ear to the ground can probably snap up some high-tier talent very soon (https://www.reuters.com/business/world-at-work/amazon-target...).
It really depends on where your company's risk and cost models are. Running on someone else's cloud just isn't the only option.
I really dislike how this industry oscillates between various states of epiphany that things that are overcomplicated and expensive are overcomplicated and expensive. As an industry, we must look like utter clowns to the world. It's really sad that saying "own or control your own servers" seems to be a sword in the stone moment for far more people than it should. Things that used to be a "duh" are now a "wow" and it's deeply unsettling to watch.
For smaller operations I’d still go with a rent-a-server model with AWS. Theirs is a critical mass though where roll your own makes sense.
The long term app model on the market model is shifting much more towards buying services vs renting infrastructure. It’s here where the AWS case falls apart with folks now buying Planet Scale vs RDS, buying DataBricks over the mess that AWS puts for for data lakes, working with model providers directly vs the headaches of Bedrock. The real long term threat is AWS continues to whiff on all the other stuff and gets reduced to a boring rent-a-server shop that market forces will drive to be very low margin.
Yes a lot of those 3rd party services will run on AWS but the future looks like folks renting servers from AWS at 7% gross margin and selling their value-add service on top at 60% gross margin.
This doesn't really explain why you wouldn't just get a hetzner. I don't have much experience with either, but if you know how to setup your infra then hetzner seems like a no-brainer? I do not want to be tied to AWS where I have no idea what my bill will be
Depending on the use case you very much could just use Herzner. A simpler and more transparent customer experience than trying to navigate the mass complexity of AWS for basic stuff.
With AI making it possible to use natural language to modify code, bare metal can make things easier to use with your own code and customization. Abstractions tend to be harder to reason about and have more limited functionality in exchange for being easier to get started on some standard setup.
I'm so surprised there is so much pushback against this.. AWS is extremely expensive. The use cases for setting up your system or service entirely in AWS are more rare than people seem to realise. Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?
> We have 730+ days with 99.993% measured availability and we also escaped AWS region wide downtime that happened a week ago.
This is a very nice brag. Given they are using their ddos protection ingress via CloudFlare there is that dependancy, but in that case I can 100% agree than DNS and ingress can absolutely be a full time job. Running some microservices and a database absolutely is not. If your teams are constantly monitoring and adjusting them such as scaling, then the problem is the design. Not the hosting.
Unless you're a small company serving up billions of heavy requests an hour, I would put money on the bet AWS is overcharging you.
The direct cost is the easy part. The more insidious part is that you're now cultivating a growing staff of technologists whose careers depend on doing things the AWS way, getting AWS certified to ensure they build your systems the AWS Well Architected Way instead of thinking themselves, and can upsell you on AWS lock-in solutions using AWS provided soundbites and sales arguments.
("Shall we make the app very resilient to failure? Yes running on multiple regions makes the AWS bill bigger but you'll get much fewer outages, look at all this technobabble that proves it")
And of course AWS lock-in services are priced to look cheaper compared to their overpricing of standard stuff[1] - if you just spend the engineering effort and IaC coding effort to move onto them, this "savings" can be put to more AWS cloud engineering effort which again makes your cloud eng org bigger and more important.
[1] (For example implementing your app off containers to Lambda, or the db off PostgreSQL to DynamoDB etc)
> The direct cost is the easy part
I don't think it is easy. I see most organizations struggle with the fact that everything is throttled in the cloud. CPU, storage, network. Tenants often discover large amounts of activity they were previously unaware of, that contributes to the usage and cost. And there may be individuals or teams creating new usages that are grossly impacting their allocation. Did you know there is a setting in MS SQL Server that impacts performance by an order of magnitude when sending/receiving data from the Cloud to your on-premises servers? It's the default in the ORM generated settings.
Then you can start adding in the Cloud value, such as incomprehensible networking diagrams that are probably non-compliant in some way (guess which ones!), and security? What is it?
> Did you know there is a setting in MS SQL Server that impacts performance by an order of magnitude when sending/receiving data from the Cloud to your on-premises servers? It's the default in the ORM generated settings.
Sounds interesting, which setting is that?
Would love to know as well.
Unfortunately it's not, and it gets more difficult the more cloud-y your app gets.
You can pay for EC2+EBS+network costs, or you can have a fancy cloud native solution where you pay for Lambda, ALBs, CloudWatch, Metrics, Secret Manager, (things you assume they would just give you, like if you eat at a restaurant, you probably won't expect to pay for the parking, toilet, or paying rent for the table and seats).
So cloud billing is its own science and art - and in most orgs devs don't even know how much the stuff they're building costs, until finance people start complaining about the monthly bills.
We run regular FinOps meetings within departments, so everyone’s aware. I think everyone should. But it’s a lot of overhead of course. So a dev is concerned not only with DevOps anymore but with DevSecFinOps. Not everyone can cope with so many aspects at once. There’s a lot of complexity creep in that.
Yeah, AWS has the billing panel, that's where I usually discover that after I make a rough estimate on how much the thing I'm building should cost by studying the relevant tables, I end up with stuff costing twice as much, because on top of the expected items there's always a ton of miscellaneous stuff I never thought about.
I have Claude, ChatGPT, and Gemini analyze our AWS bills and usage metrics once a month and they are surprisingly good at finding savings.
I was about to rage at you over the first sentence, because this is so often how people start trying to argue bare metal setups are expensive. But after reading the rest: 100% this. I see so many people push AWS setups not because it's the best thing - it can be if you're not cost sensitive - but because it is what they know and they push what they know instead of evaluating the actual requirements.
The weird thing is I'm old enough to have grown up in the pre-cloud world, and most of the stuff, like file servers, proxies, dbs, etc. isn't any more difficult to set up than AWS stuff, it's just that the skills are different
Also there's a mindset difference - if I gave you a server with 32 cores you wouldn't design a microservice system on it, would you? After all there's nowhere to scale to.
But with AWS, you're sold the story of infinite compute you can just expect to be there, but you'll quickly find out just how stingy they can get with giving you more hardware automatically to scale to.
I don't dislike AWS, but I feel this promise of false abundance has driven the growth in complexity and resource use of the backend.
Reality tends to be you hit a bottleneck you have a hard time optimizing away - the more complex your architecture, the harder it is, then you can stew.
Well, they aren't wrong about the bare metal either: Every organization ends up tied to their staff, and said staff was hired to work on the stack you are using. People end up in quite the fights because their supposed experts are more fond of uniformity and learning nothing new.
Many a company was stuck with a datacenter unit that was unresponsive to the company's needs, and people migrated to AWS to avoid dealing with them. This straight out happened in front of my eyes multiple times. At the same time, you also end up in AWS, or even within AWS, using tools that are extremely expensive, because the cost-benefit analysis for the individuals making the decision, who often don't know very much other than what they use right now, are just wrong for the company. The executive on top is often either not much of a technologist or 20 years out of date, so they have no way to discern the quality of their staff. Technical disagreements? They might only know who they like to hang out with, but that's where it ends.
So for path dependent reasons, companies end up making a lot of decisions that in retrospect seem very poor. In startups if often just kills the company. Just don't assume the error is always in one direction.
Sure but I have seen the exact same thing happen with AWS.
In a large company I worked the Ops team that had the keys to AWS was taking literal months to push things to the cloud, causing problems with bonuses and promotions. Security measures were not in place so there were cyberattacks. Passwords of critical services lapsed because they were not paying attention.
At some point it got so bad that the entire team was demoted, lost privileges, and contractors had to jump in. The CTO was almost fired.
It took months to recover and even to get to an acceptable state, because nothing was really documented.
It's simple enough to hire people with experience with both, or pay someone else to do it for you. These skills aren't that hard to find.
If you hire people that are not responsive to your needs, then, sure, that is a problem that will be a problem irrespective of what their pet stack is.
> Many a company was stuck with a datacenter unit that was unresponsive to the company's needs
I'd like to +1 here - it's an understated risk if you've got datacenter-scale workloads. But! You can host a lot of compute on a couple racks nowadays, so IMHO it's a problem only if you're too successful and get complacent. In the datacenter, creative destruction is a must and crucially finance must be made to understand this, or they'll give you budget targets which can only mean ossification.
> said staff was hired to work on the stack you are using
Looking back at doing various hiring decisions at various levels of organizations, this is probably the single biggest mistake I've done multiple times, hiring specific people using specific technology because we were specifically using that.
You'll end up with a team unwilling to change, because "you hired me for this, even if it's best for the business with something else, this is what I do".
Once I and the organizations shifted our mindset to hiring people who are more flexible, even if they have expertise in one or two specific technologies, they won't put their head in the sand whenever changes come up, and everything became a lot easier.
Exactly. If someone has "Cloud Engineer" in the headline of their resume instead of "Devops Engineer" it's already warning and worth probing. If someone has "AWS|VMWare Engineer" in their bio, it's a giant red flag to me. Sometimes it's people just being aware where they'll find demand, but often it's indicative of someone who will push their pet stack - and it doesn't matter if it's VMWare on-prem or AWS (both purely as examples; it doesn't matter which specific tech it is), it's equally bad if they identify with a specific stack irrespective of what the stack is.
I'll also tend to look closely at whether people have "gotten stuck" specialising in a single stack. It won't make me turn them down, but it will make me ask extra questions to determine how open they are to alternatives when suitable.
The entire value proposition of AWS vs running one's own server is basically this: is it easier to ask for permission, or forgiveness? You're asking for permission to get a million dollars worth of servers / hardware / power upgrades now, or you're asking for forgiveness for spending five million dollars in AWS after 10 months. Which will be easy: permission or forgiveness?
Your comment also jogged my memory of how terrible bare metal days used to be. I think now with containers it can be better but the other reason so many switched to cloud is we don’t need to think about buying the bare metal ahead of time. We don’t need to justify it to a DevOps gatekeeper.
That so many people remember bare metal as of 20+ years ago is a large part of the problem.
A modern server can be power cycled remotely, can be reinstalled remotely over networked media, can have its console streamed remotely, can have fans etc. checked remotely without access to the OS it's running etc. It's not very different from managing a cloud - any reasonable server hardware has management boards. Even if you rent space in a colo, most of the time you don't need to set foot there other than for an initial setup (and you can rent people to do that too).
But for most people, bare metal will tend to mean renting bare metal servers already configured anyway.
When the first thing you then tend to do is to deploy a container runtime and an orchestrator, you're effectively usually left with something more or less (depending on your needs) like a private cloud.
As for "buying ahead of time", most managed server providers and some colo operators also offer cloud services, so that even if you don't want to deal with a multi-provider setup, you can still generally scale into cloud instances as needed if your provider can't bring new hardware up fast enough (but many managed server providers can do that in less than a day too).
I never think about buying ahead of time. It hasn't been a thing I've had to worry about for a decade or more.
>I see so many people push AWS setups not because it's the best thing - it can be if you're not cost sensitive - but because it is what they know and they push what they know instead of evaluating the actual requirements.
I kinda feel like this argument could be used against programming in essentially any language. Your company, or you yourself, likely chose to develop using (whatever language it is) because that's what you knew and what your developers knew. Maybe it would have been some percentage more efficient to use another language, but then you and everyone else has to learn it.
It's the same with the cloud vs bare metal, though at least in the cloud, if your using the right services, if someone asked you tomorrow to scale 100x you likely could during the workday.
And generally speaking if your problem is at a scale where baremetal is trivial to implement, its likely we're only taking about a few hundred dollars a month being 'wasted' in AWS. Which is nothing to most companies, especially when they'd have to consider developer/devops time.
> if someone asked you tomorrow to scale 100x you likely could during the workday.
I've never seen a cloud setup where that was true.
For starters: Most cloud providers will impose limits on you that often means going 100x would involve pleading with account managers to have limits lifted and/or scrounding a new, previously untested, combination of instance sizes.
But secondly, you'll tend to run into unknown bottlenecks long before that.
And so, in fact, if that is a thing you actually want to be able to do, you need to actually test it.
But it's also generally not a real problem. I more often come across the opposite: Customers who've gotten hit with a crazy bill because of a problem rather than real use.
But it's also easy enough to set up a hybrid setup that will spin up cloud instances if/when you have a genuine need to be able to scale up faster than you can provision new bare metal instances. You'll typically run an orchestrator and run everything in containers on a bare metal setup too, so typically it only requires having an auto-scaling group scaled down to 0, and warm it up if load nears critical level on your bare metal environment, and then flip a switch in your load balancer to start directing traffic there. It's not a complicated thing to do.
Now, incidentally, your bare metal setup is even cheaper because you can get away with a higher load factor when you can scale into cloud to take spikes.
> And generally speaking if your problem is at a scale where baremetal is trivial to implement, its likely we're only taking about a few hundred dollars a month being 'wasted' in AWS. Which is nothing to most companies, especially when they'd have to consider developer/devops time.
Generally speaking, I only relatively rarely work on systems that cost less than in the tens of thousands per month and up, and what I consistently see with my customers is that the higher the cost, the bigger the bare-metal advantage tends to be as it allows you to readily amortise initial setup costs of more streamlined/advanced setups. The few places where cloud wins on cost is the very smallest systems, typically <$5k/month.
It’s a marketing trap. But also a job guarantee since everyone’s in the same trap. You got a couple cloud engineers or "DevOps" that lobby for AWS or any other hyperscaler, NaiveDate managers that write down some decision report littered with logical fallacies, and a few years in the sink cost is so high you can’t get off of it, and instead of doing productivity work you’re sitting in myriads of FinOps meetings, where even fewer understand what’s going on.
Engineering mangers are promised cost savings on the HR level. Corporate finance managers are promised OpEx for CapEx trade-off, the books look better immediately. Cloud engineers are embarking on their AWS journey of certification being promised an uptick to their salaries. It’s a win/win for everyone, in isolation, a local optimum for everyone, but the organization now has to pay way more than it—hypothetically—would have been paying for bare metal ops. And hypothetical arguments are futile.
And it lends itself well to overengineering and the microservices cargo cult. Your company ends up with a system distributed around the globe across multiple AZs per region of business operations, striving to shave off those 100ms latency off your clients’ RTT. But it’s outgrown your comprehension, and it’s slow anyway, and you can’t scale up because it’s expensive. And instead of having one problem, you now have 99 and your bill is one.
My last team decided to hand manage a Memcached cluster because it cost half as much as an unmanaged service versus AWS’s alternative. Don’t know how much we really saved versus opportunity cost on dev time though. But it’s close to negative.
One of the issues there is that pricing a managed service deprives your people or gaining extra experience. There’s a synergy over time, the more you manage yourself. But it’s totally justified to pick a managed service if it checks out for your budget. The problem I saw often emanate was bad decision making, bad opportunity cost estimation. In other words, there’s an opportunity cost to picking the managed service, too, and they offset each other more or less.
My manager wants me to make this silly AWS certification.
Let me go on a tangent about trains. In Spain before you board a high-speed train you need to go though full security check, like on an airport. In all other EU countries you just show up and board, but in Spain there's the security check. The problem is that even though the security check is an expensive, inefficient theatre, just in case something does blow up, nobody wants to be the politician that removed the security check. There will be no reward for a politician that makes life marginally easier for lots of people, but there will be severe punishment for a politician that is involved in a potential terrorist attack, even if the chance of that happening is ridiculously small.
This is exactly why so many companies love to be balls deep into AWS ecosystem, even if it's expensive.
Nobody gets fired for buying IB^H^H AWS
https://en.wikipedia.org/wiki/2015_Thalys_train_attack
> In all other EU countries you just show up and board, but in Spain there's the security check
Just for curiosity's sake, did any other EU countries have any recent terrorist attacks involving bombs on trains in the capital, or is Spain so far alone with this experience?
Checkout Madrid 2004 terror attacks... So deadly that Spain left Afghanistan and Iraq afik.
That's exactly the event I was alluding to, good detective work :)
London had the tube bombings, but there is no security scanning there.
AFAIK, there is no security scanning on the metro/"tube" in Spain either, it's on the national train lines.
Edit: Also, after looking it up, it seems like London did add temporary security scanners at some locations in the wake of those bombings, although they weren't permanent.
Russia is the only other European country besides Spain that after train bombings added permanent security scanners. Belgium, France and a bunch of other countries have had train bombings, but none of them added permanent scanners like Spain or Russia did.
How does Spain deal with trains that come in from a neighboring country?
The security check has nothing to do with protecting trains or passengers, so your question is irrelevant.
Thanks for letting me know that my question is irrelevant. Sorry for taking up your time.
French trains come in without any security checks.
AWS doesn’t have to be expensive.
Sure, but you outgrow the free ("trial") resources in a blink, and then it starts being expensive compared to the alternatives.
AWS may be overcharging but it's a balancing act. Going on-prem (well, shared DC) will be cheaper but comes with requirements for either jack of all trades sysadmins or a bunch of specialists. It can work well if your product is simple and scalable. A lot of places quietly achieve this.
That said, I've seen real world scenarios where complexity is up the wazoo and an opex cost focus means you're hiring under skilled staff to manage offerings built on components with low sticker prices. Throw in a bit of the old NIH mindset (DIY all the things!) and it's large blast radii with expensive service credits being dished out to customers regularly. On a human factors front your team will be seeing countless middle of the night conference calls.
While I'm not 100% happy with the AWS/Azure/GCP world, the reality is that on-prem skillsets are becoming rarer and more specialist. Hiring good people can be either really expensive or a bit of a unicorn hunt.
It's a chicken and egg problem. If the cloud didn't become such a proeminent thing, the last decade and a half would have seen the rise of much better tools to manage on-premise servers (= requiring less in-depth sysadmin expertise). I think we're starting to see such tools appear in the last few years after enough people got burned by cloud bills and lockin.
And don't forget the real crux of the problem: Do I even know whether a specialist is good or not? Hiring experts is really difficult if you don't have the skill in the topic, and if you do, you either not need an expert, or you will be biased towards those that agree with you.
It's not even limited to sysadmins, or in tech. How do you know whether a mechanic is very good, or iffy? Is a financial advisor giving you good advice, or basically robbing you? It's not as if many companies are going to hire 4 business units worth of on prem admins, and then decide which one does better after running for 3 years, or something empirical like that. You might be the poor sob that hires the very expensive, yet incompetent and out of date specialist, whose only remaining good skill is selling confidence to employers.
> Do I even know whether a specialist is good or not?
Of course but unless I misunderstood what you meant to say, you don't escape that by buying from AWS. It's just that instead of "sysadmin specialists" you need "AWS specialists".
If you want to outsource the job then you need to go up at least 1 more layer of abstraction (and likely an order of magnitude in price) and buy fully managed services.
This only gets worse as you go higher in management. How does a technical founder know what good sales or marketing looks like? They are often swayed by people who can talk a good talk and deliver nothing.
The good news with marketing and sales is that you want the people who talk a good talk, so you're halfway there, you just gotta direct them towards the market and away from bilking you.
I'm proudly 100% on prem Linux sys admin. There are not openings for my skills and they do not pay as well as whatever cloud hotness is "needed".
Nobody is hiring generalists nowadays.
At the same time, the incredible complexity of the software infrastructure is making specialists more and more useless. To the point that almost every successful specialist out there is just some disguised generalist that decided to focus their presentation in a single area.
Maybe everyone is retaining generalists. I keep being given retention bonuses every year, without asking for a single one so far.
As mentioned below, never labeled "full stack", never plan on it. "Generalist" is what my actual title became back in the mid 2000s. My career has been all over the place... the key is being stubborn when confronted with challenges and being able to scale up (mentally and sometimes physically) to meet the needs, when needed. And chill out when it's not.
> Nobody is hiring generalists nowadays.
What?
I throw up in my mouth every time I see "full stack" in a job listing.
We got rid of roles... DBA's, QA teams, Sysadmins, then front and back end. Full Stack is the "webmaster" of the modern era. It might mean front and back end, it might mean sysadmin and DBA as well.
That's the crazy thing.
Most AWS-only Ops engineers I know are making bank and in high demand, and Ops teams are always HUGE in terms of headcount outside of startups.
The "AWS is cheaper" thing is the biggest grift in our industry.
I think this is driven by the market itself and the way cloud promotes their product.
After fully in cloud for sometimes, we’re moving to hybrid solutions. The upper management happy with costs and the cloud engineer had new toy's
I wonder how vibe coding will impact this.
You can easily get your service up by asking claude code or whatever to just do it
It produces aws yaml that’s better than many devops people I’ve worked with. In other words, it absolutely should not be trusted with trivial tasks, but you could easily blow $100K’s per year for worse.
I've been contemplating this a lot lately, as I just did code review on a system that was moving all the AWS infrastructure into CDK, and it was very clear the person doing it was using an LLM which created a really complicated, over engineered solution to everything. I basically rewrote the entire thing (still pairing with Claude), and it's now much simpler and easier to follow.
So I think for developers that have deep experience with systems LLMs are great -- I did a huge migration in a few weeks that probably would have taken many months or even half a year before. But I worry that people that don't really know what's going on will end up with a horrible mess of infra code.
To me it's clear that most Ops engineers are vibe coding their scripts/yamls today.
The time difference between having a script ready has decreased dramatically in the last 3 years. The amount of problems when deploying the first time has also increased in the same period.
The difference between the ones who actually know what they're doing and the ones who don't is whether they will refactor and test.
It depends upon how many resources your software needs. At 20 servers we spend almost zero time managing our servers, and with modern hardware 20 servers can get you a lot.
Its easier than ever to do this but people are doing it less and less.
Managed servers reduce the on-prem skillset requirement and can also deliver a lot of value.
The most frustrating part of hyperscalers is that it's so easy to make mistakes. Active tracking of you bill is a must, but the data is 24-48h late in some cases. So a single engineer can cause 5-figure regrettable spend very quickly.
What size companies are we talking about
> I'm so surprised there is so much pushback against this.. AWS is extremely expensive.
Basic rationalization. People will go to extraordinary lengths to justify and defend the choices they made. It's a defense mechanism: if they spent millions on AWS they are not going to sit idly while HN discusses saving hundreds of thousands with everyone nodding and agreeing. It's important for their own sanity to defend the choice they made.
> I'm so surprised there is so much pushback against this
I'm not. It seems to be happening a lot. Any time a topic about not using AWS comes up here, or on Reddit there a sudden surge of people appearing out of nowhere shouting down anyone who suggests other options. It's honestly starting to feel like paid shilling.
I don’t think it’s paid shilling, it’s dogma that reflects where people are working here. The individual engineers are hammers and AWS is the nail.
AWS/Azure/GCP is great, but like any tool or platform you need to do some financial/process engineering to make an optimal choice. For small companies, time to market is often key, hence AWS.
Once you’re a little bigger, you may develop frameworks to operate efficiently. I have apps that I run in a data center because they’d cot 10-20x at a cloud provider. Conversely, I have apps that get more favorable licensing terms in AWS that I run there, even though the compute is slower and less efficient.
You also have people who treat AWS with the old “nobody gets fired for buying IBM” mentality.
I think a lot of engineers who remember the bare metal days have legitimate qualms about going back to the way that world used to work especially before containerization/Kubernetes.
I imagine a lot of people who use Linux/AWS now started out with bare metal Microsoft/VMWare/Oracle type of environments where AWS services seemed like a massive breath of fresh air.
I remember having to put in orders for pallets of servers which then ended up storage somewhere because there were not enough people to carry and wire them up and/or there wasn't enough rack space to install them.
Having an ability to spin up a server or a vm when you need it without having to ask a single question is very liberating. Sometimes such elasticity is exactly what's needed. OTOH other people's servers aren't always the wise choice, but you have to know both environments to make the right choice, and nowadays I feel most people don't really know anything about bare metal.
the best is having rackspace & power but not enough cooling, hahaha murder me
That only happens when you have your own data center. That's a whole different issue and most people with their own hardware don't have their own data centers as it's not particularly cost efficient except at incredibly large scale.
I spin up a VM on my xen vm estate whenever I want it with just some clickops or teraform (depending on the environment)
That's the beauty of VMs.
Luckily, Amazon is far from the only VM provider out there, so this discussion doesn't need to be polarized between "AWS everything" and "on-premise everything". You can rent VMs elsewhere for a fraction of the cost. There are many places that will rent you bare metal servers by the hour, just as if they were VMs. You can even mix VMs and bare metal servers in the same datacenter.
No doubt -- there are plenty of downsides to running your own stuff. I'm not anti-AWS. I'm pro-efficiency, and pro making deliberate choices. If there's a choice is spend $10M extra on AWS because the engineers get a good vibe -- there should be a compelling reason why that vibe is worth $10M. (And there may well be)
Look at what Amazon/Google/Microsoft does. If you told me you advocate running your own power plants, I'd eyeroll. But... if you're as large a power consumer as a hyper-scaler, totally different story. Google and Microsoft are investing in lighting up old nuclear plants.
Containers with k8s and bare metal aren't mutually exclusive.
If anything it enables a hybrid environment
> It's honestly starting to feel like paid shilling.
the companies selling Cloud are also massive IT giants with unlimited compute resources and extensive online marketing operations.
like of fucking course they're using shillbots, they run the backend shillbot infrastructure.
they literally have LLM chatbot agents as an offering, and it's trivially easy to create fake users and repost / retweet last weeks comments to create realistic looking accounts, when then shill hard for whatever their goals are.
It’s the current version of CCIE or some of the other certs. People pay money to learn how to operate AWS, other thing erode the value of their investment.
A lot of people here's careers have been made by moving into AWS. A lot of people's future careers will be made by moving out of AWS. That's just the tech treadmill in action.
Do what works best for your situation.
I think some of that is a certain group of people will do anything to play with the new shiny stuff. In my org it's cloud and now GPU.
The cloud stuff is extremely expensive and doesn't work any better than our existing solutions. Like a commentator said below, it's insidious as your entire organization later becomes dependent on that. If you buy a cloud solution, you're also stuck with the vendor deciding to double the cost of the product once you're locked in.
The GPU stuff is annoying as all of our needs are fine with normal CPU workloads today. There are no performance issues, so again...what's the point? Well... somebody wants to play with GPUs I guess.
Resume-driven development. It's probably pretty much always been a thing.
If your spend is less than a few thousand per month, using cloud services is a no-brainer. For most startups starting up, their spend is minimal, so launching on the cloud is the default (and correct!) option.
Migrating to lower cost options thereafter when scaling is prudent, but you "build one to throw away", as it were.
I'm not either. I used to do fully managed hosting solutions at a datacenter. I had to do everything from hardware through debugging customer applications. Now, people pay me to do the same but on cloud platforms and the occasional on-prem stuff. In general, the younger people I've come across have no idea how to set anything up. They've always just used awscli, the AWS Console, or terraform. I've even been ridiculed for suggesting people not use AWS. Thing is, public cloud really killed my passion for the industry in general.
Beyond public cloud being bad for the planet, I also hate that it drains companies of money, centralizes everyone's risk, and helps to entrench Amazon as yet another tech oligarchic fiefdom. For most people, these things just don't matter apparently.
> Thing is, public cloud really killed my passion for the industry in general.
Similar here, I think. I got into Computer Science because I liked software... the way it was. Now I truly think that most software completely sucks.
The thing is that it has grown so much since then, that most developers come from a different angle.
I think in 5-10 years there is going to be very profitable consulting on setting up data center infrastructure, and de-clouding for companies.
Why do you think public cloud is worse for the environment than a private dc? I'd expect the larger dcs to be more energy efficient.
I think people that lived through the time where their severs are down because the admin forgot to turn them back on after he drove 50 miles back from the colo might not want to live through that again
> I'm so surprised there is so much pushback against this.. AWS is extremely expensive.
I see more comments in favor than pushing back.
The problem I have with these stories is the confirmation bias that comes with them. Going self-hosted or on-premises does make sense in some carefully selected use cases, but I have dozens of stories of startup teams spinning their wheels with self-hosting strategies that turn into a big waste of time and headcount that they should have been using to grow their businesses instead.
The shared theme of all of the failure stories is missing the true cost of self-hosting: The hours spent getting the servers just right, managing the hosting, debating the best way to run things, and dealing with little issues add up but are easily lost in the noise if you’re not looking closely. Everyone goes through a honeymoon phase where the servers arrive and your software is up and running and you’re busy patting yourselves on the back about how you’re saving money. The real test comes 12 months later when the person who last set up the servers has left for a new job and the team is trying to do forensics to understand why the documentation they wrote doesn’t actually match what’s happening on the servers, or your project managers look back at the sprints and realize that the average time spent on self-hosting related tasks and ideas has added up to a lot more than anyone would have guessed.
Those stories aren’t shared as often. When they are, they’re not upvoted. A lot of people in my local startup scene have sheepish stories about how they finally threw in the towel on self-hosting and went to AWS and got back to focusing on their core product. Few people are writing blog posts about that because it’s not a story people want to hear. We like the heroic stories where someone sets up some servers and everything just works perfectly and there are no downsides.
You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.
> I have dozens of stories of startup teams spinning their wheels with self-hosting strategies that turn into a big waste of time and headcount that they should have been using to grow their businesses instead.
Funnily enough, the article even affirms this, though most people seemed to have skimmed over it (or not read it at all).
> Cloud-first was the right call for our first five years. Bare metal became the right call once our compute footprint, data gravity, and independence requirements stabilised.
Unless you've got uncommon data egress requirements, if you're worried about optimizing cloud spend instead of growing your business in the first 5 years you're almost certainly focusing on the wrong problem.
> You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.
This too. Most of the massive AWS savings articles in the past few days have been from companies that do a massive amount of data egress i.e. video transfer, or in this case log data. If your product is sending out multiple terabytes of data monthly, hosting everything on AWS is certainly not the right choice. If your product is a typical n-tier webapp with database, web servers, load balancer, and some static assets, you're going to be wasting tons of time reinventing the wheel when you can spin up everything with redundancy & backups on AWS (or GCP, or Azure) in 30 minutes.
> The shared theme of all of the failure stories is missing the true cost of self-hosting: The hours spent getting the servers just right, managing the hosting, debating the best way to run things, and dealing with little issues add up but are easily lost in the noise if you’re not looking closely.
What the modern software business seems to have lost is the understanding that ops and dev are two different universes. DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems and the role is absolutely no substitute for a systems administrator. Having someone that helps derive the requirements for your infrastructure, then designs it, builds it , backs it up, maintains it, troubleshoots it, monitors performance, determines appropriate redundancy, etc. etc. etc. and then tells the developers how to work with it is the missing link. Hit-by-a-bus documentation, support and update procedures, security incident response… these are all problems we solved a long time ago, but sort of forgot about moving everything to cloud architecture.
> DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems and the role is absolutely no substitute for a systems administrator.
This is revisionist history. DevOps was a reaction to the fact that many/most software development organizations had a clear separation between "developers" and "sysadmins". Developers' responsibility ended when they compiled an EXE/JAR file/whatever, then they tossed it over the fence to the sysadmins who were responsible for running it. DevOps was the realization that, huh, software works between when the people responsible for building the software ("Dev") are also the same people responsible for keeping it running ("Ops").
> DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems
DevOps, conceptually, goes back to the 90s. I was using the term in 2001. If memory serves, AWS didn't really start to take off until the mid/late aughts, or at least not until they launched S3.
DevOps was a reaction to the software lifecycle problem and didn't have anything to do with AWS. If anything it's the other way around: AWS and cloud hosting gained popularity in part due to DevOps culture.
> What the modern software business seems to have lost is the understanding that ops and dev are two different universes.
This is a fascinating take, if you ask me, treating them as separate is the whole problem!
The point of being an engineer is to solve real world problems, not to live inside your own little specialist world.
Obviously there's a lot to be said for being really good at a specialized set of skills, but thats only relevant to the part where you're actually solving problems.
A large part of the different views on this topic are due to the way people estimate the amount of saved effort and money because you're pushing some admin duties to the cloud provider instead of doing this yourself. And people come to vastly different conclusions on this aspect.
It's also that the requirements vary a lot, discussions here on HN often seem to assume that you need HA and lots of scaling options. That isn't universally true.
> A large part of the different views on this topic are due to the way people estimate the amount of saved effort and money because you're pushing some admin duties to the cloud provider instead of doing this yourself. And people come to vastly different conclusions on this aspect
This applies only if you had an extra customer that pays the difference. Basically argument only holds if you can’t take more customers because upkeeping the infrastructure takes too much time or you need to hire extra person which takes more money than AWS bill difference.
> I'm so surprised there is so much pushback against this.. AWS is extremely expensive. The use cases for setting up your system or service entirely in AWS are more rare than people seem to realise. Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?
Long term yes you can save money rolling your own.
But with cloud you can get something up and running within maybe a few days, sometimes even faster. Often with built in scalability.
This is a much easier sell to the non-tech (i.e., money) people.
If the project continues, the path of least resistance is often to just continue with the cloud solution. At a certain point, there will be so much tech debt that any savings from long term costs from the traditional on-premises, co-location or managed hosting, are vastly by the cost of migration.
> Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?
It's a way to "commoditize" engineers. You can run on premise or mixed infra better and cheaper, but only if you know what you are doing. This requires experienced guys and doesn't work with new grad hired by big cons and sold ad "cloud experts".
Also, when something breaks, you are responsible. If you put it in AWS like everyone else and it breaks, then its their problem not yours. We will still implement workarounds and fixes when it happens, but we are not responsible. Basic enterprise rules these days is to always pay someone else to be responsible.
Actually nothing new here, this was the same in the pre-cloud era where everyone in enterprises prefer big names(ibm, microsoft, oracle, ecc) to pass the responsibility to them in case of failures ... aka "nobody get fired because of buying IBM"
And the big name companies always refuse to take responsibility, and have worse reliability metrics than the lean alternatives...
but somehow that is never a problem.
Reality matters less than perception.
It's always your problem. The difference is, if you control things, you can fix it, work around it, resolve it.
If not, you're at the mercy of others.
Unless you put someone on retainer to be responsible, which you can do cheaper than to keep your AWS setup from breaking...
(I do that for people; my AWS using customers consistently end up needing more help)
The point isn't cost, it's dodging responsibility.
You can dodge responsibility equally well by outsourcing to people who'll run your bare metal setup for you. We exist from small consultancies like mine to huge multinationals.
> then its their problem not yours
this is the main advantage of cloud, no one cares if the site/service/app is down as long as it's someone else's fault and responsibility.
A lot of people here have built their whole professional careers around knowing AWS and deploying to it.
Moving away is an existential issue for them - this is why there's such pushback. A huge % of new developer and devops generation doesn't know anything about deploying software on bare metal or even other clouds and they're terrified about being unemployed.
meanwhile skills in operating systems, networking, and optimization are declining. Every system i've seen in the last 10 years or so has left huge cash on the table by not being aware of the basics.
> Maybe I'm just the old man screaming at cloud (no pun intended) but when did people forget how to run a baremetal server ?
We should coin the term "Cloud Learned Helplessness"
I'm on a Platform team of <8 people and only 3 of us (most experienced too) come from sysadmin backgrounds. The rest have only ever known containers/cloud and never touched (both figuratively and literally :-) bare metal servers in their careers.
They've never used tools like Ansible (or Anaconda) or been in situations where they couldn't destroy the container and start afresh instantly.
I once moved a small site from AWS to Digital Ocean + Cloudflare.
$100-$300 on AWS -> $35/mo for DO + CF. Coincidentally, AWS had an outage soon after, which was avoided thanks to the move.
I have used DO for both clients and myself, and have not had any huge problems with them.
As the author points out AWS can provide a few things that you wouldn’t want to try and replicate (like CloudFront) but for most other things you’re very much correct. AWS is ultimately very expensive for what it is. The complicated billing that’s full of surprises also makes cost management a head-banging experience.
Fair, though using AWS solely for CloudFront would mean you should compare to Cloudflare, Akamai, Fastly, etc. I'm not sure if the value prop for it looks so great if you don't include the "integrated with your other AWS stuff" benefit.
Agree, CloudFront isn’t super competitive with CDN focused vendors. It’s basically the “well you’re already on AWS so may as well just use this” play.
I mean, AWS egress is so expensive that I'd put something else in front of it for anyone who has any decent amount of traffic.
I work for a small company owned by a huge company. We are entirely independent except for purchasing, IT, and budget approval. We run our CI on AWS, and it’s slow and flaky for a variety of reasons (compiling large c++ projects combined with instance type pressure). It’s also expensive.
We planned a migration to move from 4OD instances to one on prem machine and we guessed we’d save $1000/mo, our builds would be faster and we’d have less failures due to capacity issues. We even had a spare workstation and a rack in the office that so the capex was 0.
I plugged the machine into the rack and no internet connectivity. Put in an IT ticket which took 2 days for a reply, only to be told that this was an unauthorised machine and needed to be imaged by IT. The back and forth took 4 weeks, multiple meetings and multiple approvals. My guess is that 4 people spent probably 10 hours arguing whether we should do this or not.
On AWS I can write a python script and have a running windows instance in 15 minutes.
This is the root success of aws, it lets internal teams bypass sysadmin departments.
The same story applies for software. If I want to buy a license of X for someone, I have to go through procurement, and it takes weeks even for <$50 purchases. Yet if its on the AWS marketplace it’s pre approved as long is doesn’t breach the AWS budget.
Working around official IT was certainly a significant factor early on. I'm less convinced it is nearly as big a driver (or a downside depending on your perspective) today.
Especially considering that outside of startups (where approval would be fast with or without cloud), virtual infrastructure also got its own bureaucratic process.
A lot of people forget that, when server virtualization was still gaining momentum in a lot of circles, it wasn't uncommon at less technically savvy customers--say a regional bank at the time--to be told that it might take 2 months to provision a new server.
I don't think anyone is forgetting that in this thread, as there's dozens of answers mentioning this.
But as an example: It took about 3 months to provision an AWS server in a recent company I consulted for due to their own bureaucracy and ineptitude of the Ops team.
On the other hand, when I needed a few CI servers for a startup I worked at, I just collected them from AppleStore during lunch hour.
Now this above is what people are "forgetting" and don't want to listen to.
There is this belief that it is not extremely expensive and/or that the ops cost of bare metal will outpace it. It is a belief, and it is very rarely supported by facts.
Having done consulting in this space for a decade, and worked with containerised systems since before AWS existed, my experience is that managing an AWS system is consistently more expensive and that in fact the devops cost is part of what makes AWS an expensive option.
The complexity of AWS versus bare metal depends on what you are doing. Setting up an apache app server: just as easy on bare metal. Setting up high availability MySQL with hot failover: much easier on AWS. And a lot of businesses need a highly available database.
Most businesses really don't need that complexity. They think they do. Premature optimization.
If your database has a hardware failure then you could loose all sales and customer data since your last backup, plus cost of the down time while you restore. I struggle to think of a business where that is acceptable.
Why are you ignoring the huge middle ground between "HA with fully automated failover" and "no replication at all"?
Basic async logical replication in MySQL/MariaDB is extremely easy to set up, literally just a few commands to type.
Ditto for doing failover manually the rare times it is needed. Sure, you'll have a few minutes of downtime until a human can respond to the "db is down" alert and initiates failover, but that's tolerable for many small to medium sized businesses with relatively small databases.
That approach was extremely common ~10-15 years ago, and online businesses didn't have much worse availability than they do today.
I've done quite a few MySQL setups with replication. I would not call setup "extremely easy", but then, I'm not a full time DB admin. MySQL upgrades and general trouble shooting is so much more painful than AWS aurora where everything just takes a few clicks. And things like blue/green deployment, where you replicate your entire setup to try out a DB upgrade, are really hard to do onprem.
Without specifics it's hard to respond. But speaking as a software engineer who has been using MySQL for 22 years and learned administrative tasks as-needed over the years, personally I can't relate to anything you are saying here! What part of async replication setup did you find painful? How does Aurora help with troubleshooting? Why use blue/green for upgrade testing when there are much simpler and less expensive approaches using open source tools?
My "Homeserver" with its database running on an old laptop has less downtime than AWS.
I expect most, if not 99%, of all businesses can cope with a hardware failure and the associated downtime while restoring to a different server, judging from the impact of the recent AWS outage and the collective shrug in response. With a proper raid setup, data loss should be quite rare, if more is required a primary + secondary setup with a manual failover isn't hard.
That's not the same as a "high availibility hot swap redundant multi region database".
Running mysqldump to a usb disk in the office once a day is pretty cheap.
A high availability MySQL server on AWS is about the same difficulty as on your own kubernetes instance (I've got a play one on one of those $100 N100 machines, got one with 16G mem). Then:
And then you can just provision MariaDB "kind", ie. you kubectl apply with something specifying database name, maximum memory, type of high availability (single primary or multimaster) and secret reference and there you go: new database, ready to be plugged into other pods.Dont you need ECC in your db nodes?
N100 supports DDR5 memory (although 1 channel) but I believe DDR5 has some error correction... May not be full ECC
amazing how nobody even know about ECC these days.
see so many series B+ companies running DB and storage without a care in the world.
Forget? You have to hire people for that. We are a software organization. We build software. If we rent in the cloud, there is less HR hassle - hiring, raises, bonuses, benefits, firing … none of that headache involved with the cloud.
Technically? Totally doable. But the owners prefer renting in the cloud over the people-related issues of hiring.
This is exactly the rhetoric Microsoft used in the 00's with it's "Get the facts" marketing campaign against Linux and open-source: "Never mind the costs, think about the people hours you are saving!".
It wasn't as simple as that then, at it's still not as simple as that now.
This is true, but also really funny considering that even today the average windows sysadmin can still barely use powershell and relies on console clicking and batch scripts. A good unix admin can easily admin 10-100x the machines as a windows admin, and this was more true back in the early 00s. So the marketing on getting the facts was absolutely false.
Nope and never has been but to (some of) both sides “it depends” means you are on the other side.
It’s become polarised (as everything seems to).
I’ve specced bare metal, I’ve specced AWS, which is used entirely a matter of the problem/costs and relative trade-offs.
That is all it is.
In fairness to Microsoft, this argument should have been correct. It ought to be possible for Microsoft to offer products with better polish and better support than open source alternatives, and that ought to more than compensate for any licensing costs. Whether Microsoft actually managed to do this is debatable, but the principle is sound enough.
It sort of was especially with respect to desktop software. The licensing costs associated with Microsoft Office etc. were probably not really that much compared to the disruption with switching offices of people who just wanted to do their job to open source alternatives.
This is the fallacy that Amazon sold everyone on: that the cloud has no headache or managment needed. This is manifestly untrue. It's also untrue that bare metal takes lots of management time. I have multiple Dell rack servers colocated in several different datacenters, and I don't spend any time at all managing them. They just run.
> This is the fallacy that Amazon sold everyone on
I’ve been working at a place for a long time and we have our own data centers. Recently there has been a push to move to the public cloud and we were told to go through AWS training. It seems like the first thing AWS does in its training is spend a considerable amount of time on selling their model. As an employee who works in infrastructure, hearing Amazon sell so hard they the company doesn’t need me anymore is not exactly inspiring.
After that section they seem to spend a considerable amount of time on how to control costs. These are things no one really thinks about currently, as we manage our own infra. If I want to spin up a VM and write a bunch of data to it, no one really cares. The capacity already exists and is paid for, adding a VM here or there is inconsequential. In AWS I assume we’ll eventually need to have a business justification for every instance we stand up. Some servers I run today have value, but it would be impossible to financially justify in any real terms when running in AWS where everything has a very real cost assigned to it. What I do is too detached from profit generation, and the money we could save is mostly theoretical, until something happens. I don’t know how this will play out, but I’m not excited for it.
I can confirm this.
The AWS mandatory training I did in the past was 100% marketing of their own solutions, and tests are even designed to make you memorize their entire product line.
The first two levels are not designed for engineers: they're designed for "internal salespeople". Even Product Managers were taking the certification, so they would be able to recommend AWS products to their teams.
You miss the good time spent debugging a firmware issue, which leads to packet drop on the NIC (or data corruption on the nvme)
I do not miss that crap
Every company I’ve consulted for has hired a team dedicated to just setting up and monitoring AWS for the software devs. Hell, you’d probably reduce headcount running on bare metal.
In more than 15 years of experiences, in various compagnies, the number of people who can build and run an on-premise infrastructure sanely can be counted on my right hand fingers
These people exist, but we have far more stupid "admins" around here
When you are not in the infrastructure business (I work in retail at the moment), the public cloud is the sane way to go (which is sad, but anyway)
Pretty much this. Most companies have the "devops" folks fully dedicated to maintaining the cloud stuff.
I have spent about 1 day waiting for every 5 days doing stuff at my last 3 jobs all of which were growing companies thinking that they needed the power of the cloud, but they sure as hell were not paying to make it fast or easy to use.
Pay some "devops" folks and then underfund them and give them a mandate of all ops but with less people and also you need to manage the constant churn of aws services and then deal with normal outages and dumb dev things.
I help people run their systems.
Clients that use cloud consistently end up spending more on devops resources, because their setups tends to be wastly more complex and involve more people.
I've worked on both kinds of companies in almost 25 years and I can confirm this is true.
The biggest ops teams I worked alongside were always dedicated to running AWS setups. The slowest too were dedicated to AWS. Proportionally, I mean, of course.
People here are comparing the worst possible of Bare Metal with "hosting my startup on AWS".
> The biggest ops teams I worked alongside were always dedicated to running AWS setups. The slowest too were dedicated to AWS.
I wish I could come up with some kind of formalization of this issue. I think it has something to do with communication explosions across multiple people.
Increases in complexity exponentially increases mistakes + MS Teams meetings are just a glorified game of telephone.
Don't make perfect the enemy of the good.
Just because AWS abstracted something doesn't mean you don't need people who understand all the quirks of the black box you supposedly don't have to worry about. Guess what those people are expensive. You also have to deal with a ton of crap like hard resource account limits that on any meaningful size project will push complexity up by forcing you to use multiple accounts.
Ultimately these owners hire me to cut their 6-figure AWS bill by 50%. It's mostly rearchitecting mistakes. Amongst them is taking AWS blog propaganda at face value. Those savings could be 80% if they chose managed bare metal (no racking and stacking).
> Forget? You have to hire people for that. We are a software organization. We build software.
You don't need to hire dedicated people full time. It could even be outsourced and then a small contract for maintenance.
It's the same argument you could say for "accounting persons", or "HR persons" - "We are a software organisation!" - Personally I don't buy the argument.
Outsourcing and cloud cost are always underestimated.
> It could even be outsourced and then a small contract for maintenance.
Yeah, those people we outsourced to happen to work at AWS.
They don't though. You still need devops when you use AWS, and most organisations end up needing more time spent on devops when they use AWS.
> We build software
Right, doesn't that include figuring out the right and best way of running it, regardless if it runs on client machines or deployed on servers?
At least I take "software engineering" to mean the full end-to-end process, from "Figure out the right thing to build" to "runs great wherever it's meant to run". I'm not a monkey that builds software on my machine and then hands it off to some deployment engineer who doesn't understand what they're deploying. If I'm building server software, part of my job is ensuring it's deployed in the right environment and runs perfectly there too.
Forgot? Driving something on AWS needs also a lot of people. In my experience even more. The term SRE did not exist before.
Until you factor in the legions of devops writing terraform, iam, and cicd scripts.
I really dislike the fallacy that just because you're buying something it means that you're not building anything. In practice this is never true: there's always some people-in-your-org time cost of buying something just as much as there's some giving-money-to-other-orgs cost to building something. So often organisations wind up buying something and spending way more time in the process than it would cost for them to build it themselves.
With AWS I think this tradeoff is very weak in most cases: the tasks that you are paying AWS for are relatively cheap in time-of-people-in-your-org, and AWS also takes up a significant amount of that time with new tasks as well. Of the organisations I'm personally aware of, the ones who hosted on-prem spent less money on their compute and had smaller teams managing it, with more effective results than those who were cloud-based (to various degrees of egregousness from 'well, I can kinda see how it's worth it because they're growing quickly' to 'holy shit they're setting money on fire and compromising their product because they can't just buy some used tower PCs and plug them in in a closet in the office')
Don't you have cloud architects and similiar figures already?
The cloud is incredibly profitable for the efficiencies and improvements its introduced and held onto.
Easy to push back against what is now the unknown (bare metal), when the layers extending bare metal to cloud service have become better and better, as well as more accessible.
It's always nice to remember that AWS is responsible for 70% of Amazon profits.
As Jeff Bezos has been quoted as saying "your margin is my opportunity"...
The biggest difficulty in eating into AWS market share is that believing it is cheap has become religion.
I'm not going to argue that AWS can be expensive but in my experience its biggest advantage is SPEED. In every company I worked for that ran their own data centers ever damn thing took FOREVER. new servers took months to buy and rack. any network change like a new VLAN took days to weeks. It was so annoying. But in AWS almost anything is just an API call and a few minutes at most from being enabled. It is so much more productive.
For my org. I don’t have budget for a dedicated in-house opsec team, so if I on-prem it triggers additional salary burden for security . How would I overcome this?
You can't. That's the use case FOR AWS/GCP. Once the differential between having a in-house team and the AWS premium becomes positive is when you make the switch.
A lot of the discussion here is that the cost of the in-house team is less than people think.
For instance: at a former gig, we used a service in the EU that handled weekends, holidays and night time issues and escalated to our team as needed. It was pretty cheap, approximately $10K monthly fee for availability and hourly rate when there were any issues to be resolved. There were a few mornings I had an email with a post-mortem report and an invoice for a hundred euros or so. We came pretty close to 5 9's uptime but we didn't have to worry about SLA's or anything.
There is also the factor that the idea that you don't need administrators for AWS is bullshit. Cool idea, bro. Go to your favorite jobs portal. Search for "devops" ... 1000s of jobs. I click on the first link.
Well, well, they have a whole team doing "devops administration" on AWS and require extra people. So not having the money for an in-house team ... no AWS for you.
I've worked for 2 large-ish firms in the past 3 years. One huge telco, one "medium" telco (still 100s of people). BOTH had a team just for AWS IAM administration. Only for that one thing, because that was company-wide (and was regularly demonstrated to be a single point of failure). And they had AWS administrator teams, yes teams, for every department (even HR had one, though in the medium telco all management had a shared team, but the networking and development departments still had their own AWS teams, who, btw, also did IAM. The company-wide IAM team maintained an AWS IAM and some solution they'd bought that also worked for their windows domain and ticketing system (I hate you IBM remedy), and eqiupment ordering portal and ...)
AND there were "devops" positions on every development team, and on the network engineering team, and even a small one for the building "technics" team.
Oh and they both had an internal cluster on top of AWS, part on-premise, part rented DC space, which did at least half the compute work (but presumably a lot less of the weird edge-cases), that one ran the company services that are just insane on AWS like any kind of video.
Yeah, you need less admin, depending but not none. And AWS pushes you towards devops heavy solutions.
Exactly. this is the margin aws trives from.
they sell "you don't need a team"... which is true om your prototype and mvp phase. and you know when you grow you will have an ops team and maybe move out.
but in the very long middle time... you will be supporting clients and sla etc, and will end up paying both aws AND an ops team without even realizing.
Use the same people who are now maintaining your complex AWS setup. It's not like that doesn't need maintenance or oncall.
If you don't have budget for someone to handle this for you, you can't afford AWS either, as you still need to handle the same things and they're generally more complex when you use AWS.
Familiarize yourself with your company’s decision process on strategic decisions like this. Ensure you have a way to submit a proposal for a decision on making the change (or find someone who has that access to sponsor your proposal), build a business case that shows cost of opsec team, hardware and everything else is lower than AWS (or if cost is higher then some other business value is gained from making the change — currently digital sovereignty could be a strong argument if you are EU based).
If you cant build a positive business case then its not the correct move. Cash is king. Sadly.
The consequence of running ingress and DNS poorly is downtime.
The consequence of running a database poorly is lost data.
At the end of the day they're all just processes on a machine somewhere, none of it is particularly difficult, but storing, protecting, and traversing state is pretty much _the_ job and I can't really see how you'd think ingress and DNS would be more work than the datastores done right.
Now with AWS, I have a SaaS that makes 6 figures and the AWS bill is <$1000 a month. I'm entirely capable of doing this on-prem, but the vast majority of the bill is s3 state, so what we're actually talking about is me being on-call for an object store and a database, and the potential consequences of doing so.
With all that said, there's definitely a price point and staffing point where I will consider doing that, and I'm pretty down for the whole on-prem movement generally.
I'm generally strongly in favour of bare metal (not so much actually on prem) but your case is one of the rare cases wher AWS makes sense. Even for cheap setups like that, bare metal could likely be cheaper even factoring in someone on call to handle issues for you, but the amounts are so small it's a perfectly reasonable choice to just pick whatever you're comfortable with.
That's the sweet spot for AWS customers. Not so much for AWS.
The key thing for AWS is trying to get you locked in by "helping you" depend on services that are hard to replicate elsewhere, so that if your costs grow to a point where moving elsewhere is worth it, it's hard for you to do so.
It’s expensive and the “design” of the services, if you could call it that, is such that you are forced to pay a lot, or play a lot of games to get around it. If you are going to spend your engineering time working around their ridiculous pricing schemes, you might as well spend the money on building things out yourself.
Perfect example - MSK. The brokers are config locked at certain partition counts, even if your CPU is 5%. But their MSK replicator is capped on topic count. So now I have to work around topic counts at the cluster level, and partition counts at the broker level. Neither of which are inherent limits in the underlying technologies (kafka and mirrormaker)
AWS (along with the vast majority of B2B services in the software development industry) is good because it allows you to focus on building your product or business without needing to worry about managing servers nearly as much.
The problems here are no different than using SaaS anywhere else in a business, you can also run all your sales tracking through excel, it's just that once you have more than a few people doing sales that becomes a major bottleneck the same way not having an easier to manage infrastructure system.
In the early days of cloud service providers, they offered a handful of high-value services, all at great prices, making them cost-competitive with bare metal but much easier. That was then.
Things today are different. As cloud service providers have grown to become dominant, they now offer a vast, complicated tangle of services, microservices, control panels, etc., at prices that can spiral out of control if you are not constantly on top of them, making bare metal cheaper for many use cases.
> they offered a handful of high-value services, all at great prices, making them cost-competitive with bare metal but much easier
That was never the case for AWS, the point was never "We're cheap" but "We let you scale faster for a premium".
I first came across cloud services around 2010-2011 I think, when the company I worked at at the time started growing and we needed something better than shared hosting. AWS was brought up as a "fresh but expensive" alternative, and the CTO managed to convince the management that we needed AWS even if it was expensive, because it'll be a lot easier to tear up/down servers as we need it. Bandwidth costs I think was the most expensive part of the package, at least back then.
When I look at what performance per $ you get with AWS et al today, it looks the same, incredibly expensive for the performance you (don't) get. Better off with dedicated instances unless you team is lacking the basic skills of server management, or until the company really grown so it keeps being difficult dealing with the infrastructure, then hire a dedicated person and let them make the calls for what's next.
I'd agree that AWS never sold on being cheaper, but there is one particular way AWS could be cheaper and that is their approach to billing-by-the-unit with no fixed costs or minimum charges.
Being able to start small from a $1/mth bill without any fixed cost overheads is incredibly powerful for small startups.
If I wanted to store bytes in a DC it would cost $10k/mth by the time I was paying colo/ servers/ disks before I stored my first byte. Sure there wouldn't be any incremental costs for the second byte but thats a steep jump. S3 would have cost me $0.02. Being able to try technology and prove concepts at the product development stage is very powerful and why AWS became not just a vendor but a _technology partner_ for many companies.
> Being able to start small from a $1/mth bill without any fixed cost overheads is incredibly powerful for small startups.
Yes, no doubt about it. Initially AWS was mostly sold as "You never know when you might want to scale fast, imagine being featured in a newspaper and your servers can't handle the load, you need cloud for that!" to growing startups, and in that context it kind of makes sense, pay extra but at least be online.
But initially when you're small, or later when you're big and establish, other things make more sense. But yes, I agree that if you need to aggressively be able to scale up or down, cloud resources make sense to use for that, in addition to your base infrastructure.
But if AWS didn't have that anti-competitive data transfer fee that gets waived if your traffic goes to an internal server, why would you choose S3 vs a white-label storage vendor's similar offering?
> the point was never "We're cheap" but "We let you scale faster for a premium"
Actually, it was more like "Scale faster, easier, more reliably, with proven hardware and software infrastructure, operated by a proven organization, at a price point that is competitive with the investment you'd have to make to get comparable hardware, software, and organizational infrastructure." But that was then. Today, things are different. Cloud services have become giant hairballs of complexity, with plenty of shoot-yourself-in-the-foot-by-default traps, at prices that can quickly spiral out of control if you're not on top of them.
This. When AWS was 10 solid core services it made sense and was exciting. It’s now a bloated mess of 200+ services (many of which almost nobody uses) with all that complexity starting to create headaches and cracks.
AWS needs to stop trying to have a half-arsed solution to every possible use case and instead focus on doing a few basic things really well.
Imo the fact that an "AWS Certified Solutions Architect" is yet another AWS service/thing that is attainable, via an actual exam[0] for $300, is indicative of just how intentionally bloated the entire system has become.
[0] https://aws.amazon.com/certification/certified-solutions-arc...
(Real question, not meant to be sarcastic or challenging!) -- What are the challenges in trying to use just the ~10 core services you want/need and ignoring the others? What problems do the others you don't use cause with this use case?
The early services were mostly self-contained.
A lot of newer stuff that actually scales (so Lightsail doesn't count) is entangled with "security", "observability" and "network" services. So if you just want to run EC2 + RDS today, you also have to deal with VPC, Subnets, IAM, KMS, CloudWatch, CloudTrail, etc.
Since security and logs are not optional, you have very limited choice.
Having that many required additional services means lots of hidden charges, complexity and problems. And you need a team if you're not doing small-scale stuff.
Costs have not dropped. Computing becomes cheaper over time, but AWS largely does not.
They used to release new ec2 sizes at the same price as the previous gen which made upgrading a no brainer. That stopped with m7 and doesn’t seem to be coming back.
Not sure what Amazon plans to do when the m6 hardware starts wearing out.
"Embrace, extend, extinguish". It was a Microsoft saying, but it explains Amazon's approach to Linux. Once your customers are skilled in how to do things on your platform, using your specialized products, they won't price-comparison (or compare in any other way) to competing options. Whether those countless other "half-arsed solutions" actually make money is beside the point; as long as the customer has baked at least one into their tech stack, they can't easily leave.
I don’t think I’ve seen a menu as hilariously bad as the AWS dashboard menu. No popup menu should consume the entire screen edge to edge. Just a wall of cryptic service names with ambiguous icons.
Word on the street is that Amazon leadership basically agrees with this and recognizes things have gotten off course. AWS is a small number of things that make money and then a whole bunch of slop and bloat.
AWS was mostly spared from yesterday’s big cuts but have been told to “watch this space” in the new year after re:Invent.
Anytime I have to go into the AWS control panel (which is often) I am immediately overwhelmed with a sense of dread. It's just the most bloated overcomplicated thing I could possibly imagine.
You're lucky not to have dealt with Azure and GCP control panels, in that case :-)
GCP is pretty good though, considering the complexity.
Azure is ... a different story...
AFAICT no AWS service has ever had a price increase. This is nonsense.
Considering you get exponentially more compute/hardware for the same money every 2 years or so, they haven't been getting that much cheaper.
Every generation of CPU has cost more than the last one for years now.
Cloud has been generally getting cheaper if you take inflation into account. But hating AWS is the fad so...
...while on the other side, the "traditional" hosting/colocation providers feel the squeeze and have to offer more competitive prices to stay in business?
These are the features that AWS provides
(1) Massive expansion of budget (100 - 1000x) to support empire building. Instead of one minimum-wage sysadmin with 2 high-availability, maxed-out servers for 20K - 40K (and 4-hour response time from Dell/HPE), you can have 100M multi-cloud Kubernetes + Lambda + a mix-and-match of various locked-in cloud services (DB, etc.). And you can have a large army of SRE/DevOps. You get power and influence as a VP of Cloud this and that and 300 - 1000 people reporting to you.
(2) OpEx instead of CapEx
(3) All leaders are completely clueless about hiring the right people in tech. They hire their incompetent buddies who hire their cronies. Data centers can run at scale with 5-10 good people. However, they hire 3000 horrible, incompetent, and toxic people, and they build lots of paperwork, bureaucracy, and approvals around it. Before AWS, it was VMware's internal cloud that ran most companies. Getting bare metal or a VM will take months to years, and many, many meetings and escalations. With AWS, here is my credit card, pls gimme 2 Vms is the biggest feature.
The problem with those 5 people, is you can't hire a 6th - your stack is custom and probably even if you find the guy, he'll need months of ramp-up.
In contrast, you could throw a stone into a bush and hit an AWS guy.
If your 6th needs months to understand how the basic blocks in your system are arranged then he might not be one of the "good" guys
Not really a hardcore infra guy, but on the coding side, I know companies with products that have codebases in the multi million LoC range written over decades, one of my friends interned there and told me they didn't even let him work on the core product for months, they put him on some custom testing framework they had for it, just so he could get familiar enough with the core code to be able to contribute meaningfully.
He told me that before they started doing that, there were incidents like teams writing entire modules they didn't know already existed - now there were 2 pieces of code doing basically the same thing, that were just incompatible enough to not be possible to merge them.
The core of this success is this, IMO:
Which TBH applies to many, many places, even if they are not aware of it.I'd say the core of their success is running everything in a single rack in a single datacenter at first (for months? a year?) and getting lucky. Life is simple when you don't need the costs and effort of reliability upfront.
They mentioned having a backup AWS cluster that would spin up when something happens.
They mention having a second half-rack in a different DC.
In any case, not everyone need five nines, and usually it's just much easier to bring down a platform due to some bug in your own software rather that the core infrastructure going down at a rack level.
The point is valid, they mention adding that, so at one point they didn't have that. They're also only storing monitoring & observability data, that's never going to be mission critical for their customers.
It's probably the main reason why they were able to get away with this and why their application does not need scalability. I see they themselves are only offering two 9s of uptime.
I had a problem figuring out why the place I was working wanted to move from in-house to AWS; their workload was easily handled by a few servers, they had no big bursts of traffic, and they didn't need any of the specialized features of AWS.
Eventually, I realized that it was because the devs wanted to put "AWS" on their resumes. I wondered how long it would take management to catch on that they were being used as a place to spruce up your resume before moving on to catch bigger fish.
But not long after, I realized that the management was doing the same thing. "Led a team migration to AWS" looked good on their resume, also, and they also intended to move on/up. Shortly after I left, the place got bought and the building it was in is empty now.
I wonder, now that Amazon is having layoffs and Big Tech generally is not as many people's target employer, will "migrated off of AWS to in-house servers" be what devs (and management) want on their resume?
with "dev wanting X" nothing happens. "leadership deciding X" then it needs to get done.
Devs wanting to put AWS on their resume push for it, then the next wave you hire only knows AWS.
And then discussions on how to move forward are held between people that only know AWS and people who want to use other stuff, but only one side is transparent about it.
Many other points. When the Cloud Started, they offered great value in adjacent product and services. Scaling was painful, getting bare metal hardware have long lead time, provisioning takes time. DC was not of as high quality, Network wasn't as redundant. A lot of these today are much less of an issue.
In 2010 you could only get 64 Core Xeon CPU coming in 8 Sockets, or maximum or 8 Core per socket. And that is ignoring NUMA issues. Today you could get 256 Core per socket that is at least twice as fast per core. What used to be 64 Server could now be fitted into 1. And by 2030, it would be closer to 100 to 1 ratio. Not to mention Software on Server has gotten a lot faster compared to 2010. PHP, Python, Ruby, Java, ASP or even Perl. If we added up everything I wouldn't be surprised we are 200 or 300 to 1 ratio compared to 2010.
I am pretty sure there is some version of Oxide in the pipeline that will catch up to latest Zen CPU Core. If a server isn't enough, a few Oxide Rack should fit 99% of Internet companies usage.
> Cloud makes sense when elasticity matters; bare metal wins when baseload dominates.
This really is the crux of the matter in my opinion, at least for applications (databases and so on is in my opinion more nuanced). I've only worked at one place where using cloud functions made sense (keeping it somewhat vague here): data ingestion from stations that could be EXTREMELY bursty. Usually we got data from the stations at roughly midnight every day, nothing a regular server couldn't handle, but occasionally a station would come back online after weeks or new stations got connected etc which produced incredible load for a very short amount of time when we fetched, parsed and handled each packet. Instead of queuing things for ages we could instead just horizontally scale it out to handle the pressure.
FD: I work at Amazon, I also started my career in a time where I had to submit paper requests for servers that had turn around times measured in months.
I just don't see it. Given the nature of the services they offer it's just too risky not to use as much managed stuff with SLAs as possible. k8s alone is a very complicated control plane + a freaking database that is hard to keep happy if it's not completely static. In a prior life I went very deep on k8s, including self managing clusters and it's just too fragile, I literally had to contribute patches to etcd and I'm not a db engineer. I kept reading the post and seeing future failure point after future failure point.
The other aspect is there doesn't seem to be an honest assessment of the tradeoffs. It's all peaches and cream, no downsides, no tradeoffs, no risk assessment etc.
At another big-4 hyperscaler, we ended up with substantial downtime and a lossy migration because they didn’t know how to manage kubernetes.
Microk8s doesn’t use etcd (they have their own, simpler thing), which seems like a good tradeoff at single rack scale: https://benbrougher.tech/posts/microk8s-6-months-later/
The article’s deployment has a spare rack in a second DC and they do a monthly cutover to AWS in case the colo provider has a two site issue.
Spending time on that would make me sleep much better than hardening a deployment of etcd running inside a single point of failure.
What other problems do you see with the article? (Their monthly time estimates seem too low to me - they’re all 10x better than I’ve seen for well-run public cloud infrastructure that is comparable to their setup).
Managing a complex environment is hard, no matter whether that’s deployed on AWS or on prem. You always need skilled workers. On one platform you need k8s experts. On the other platform you need AWS experts. Let’s not pretend like AWS is a simple one-click fire and forget solution.
And let’s be very real here: if your cloud service goes down for a few hours because you screwed something up, or because AWS deployed some bad DNS rules again, the world moves on. At the end of the day, nobody gives a shit.
I agree that a business should use Kubernetes only if there is a clear need for that level of infrastructure automation. It's a time and money mistake to use K8s by default.
Many startups and companies couldn't exist if there was only AWS (or GCP / Azure) due to how much they overcharge.
For example, we couldn't offer free GeoIP downloads[0] if we were charged the outrageous $0.09 / GB, and the same is true for companies serving AI models or game assets.
But what makes me almost sick is how slow is the cloud. From network-attached disks to overcrowded CPUs, everything is so slooooow.
My experience is that the cloud is a good thing between 0-10,000 $ / month. But you should seriously consider renting bare-metal servers or owning your own after that. You can "over-provision" as much as you want when you get 10-20x (real numbers) the performance for 25% of the price.
[0] https://downloads.pingoo.io
I’ve seen cloud slowness create weird Stockholm syndrome effects, especially around disk latency.
It always makes sense to compare to back of the envelope bare metal numbers before rearchitecting your stack to work around some dumb cloud performance issue.
> Equinix Metal got the closest, but bare metal on-demand still carried a 25-30% premium over our CapEx plan. Their global footprint is tempting; we may still use them for short-lived expansion.
> The Equinix Metal service will be sunset on June 30, 2026.
https://docs.equinix.com/metal/
I put our company onto a hybrid AWS-colocation setup to attempt to get the best of both worlds. We have cheap fiddly/bursty things and expensive stable things and nothing in between. Obviously, put the fiddly/bursty things in AWS and put the stable things in colocation. Direct Connect keeps latency and egress costs down; we are 1 millisecond away from us-east-1 and for egress we pay 2¢/GB instead of the regular 9¢/GB. The database is on the colo side so database-to-AWS reads are all free ingress instead of egress, and database-to-server traffic on the colo side doesn't transit to AWS at all. The savings on the HA pair of SQL Server instances is shocking and pays for the entire colo setup, and then some. I'm surprised hybrids are not more common. We are able to manage it with our existing (small) staff, and in absolute terms we don't spend much time on it--that was the point of putting the fiddly stuff in AWS.
The biggest downside I see? We had to sign a 3 year contract with the colocation facility up front, and any time we want to change something they want a new commitment. On AWS you don't commit to spending until after you've got it working, and even then it's your choice.
> It depends on your workload.
Very much this.
Small team in a large company who has an enterprise agreement (discount) with a cloud provider? The cloud can be very empowering, in that teams who own their infra in the cloud can make changes that benefit the product in a fraction of the time it would take to work those changes through the org on prem. This depends on having a team that has enough of an understanding of database, network and systems administration to own their infrastructure. If you have more than one team like this, it also pays to have a central cloud enablement team who provides common config and controls to make sure teams have room to work without accidentally overrunning a budget or creating a potential security vulnerability.
Startup who wants to be able to scale? You can start in the cloud without tying yourself to the cloud or a provider if you are really careful. Or, at least design your system architecture in such a way that you can migrate in the future if/when it makes sense.
As someone who works with firmware, it is funny how different our definitions of "bare metal" is.
As someone who does material science, it's funny how our definition of "bare metal" is so different.
As someone who listens to loud rock and roll music …
Ask an astronomer what a “metal” is.
Wikipedia still thinks it means the thing I (and presumably you) do.
https://en.wikipedia.org/wiki/Bare_metal
Edit: For clarity, wikipedia does also have pages with other meanings of "bare metal", including "bare metal server". The above link is what you get if you just look up "bare metal".
I do aim to be some combination of clear, accurate and succinct, but I very often seem to end up in these HN pissing matches so I suppose I'm doing something wrong. Possibly the mistake is just commenting on HN in itself.
Seems there is a difference between "Bare Metal" and "Bare Machine".
I'm not sure what you did, but when you go to that Wikipedia article, it redirects to "Bare Machine", and the article contents is about "Bare Machine". Clicking the link you have sends you to https://en.wikipedia.org/wiki/Bare_machine
So it seems like you almost intentionally shared the article that redirects, instead of linking to the proper page?
I indeed deliberately pasted a link that shows what happens when you try to go to the Wikipedia page for "bare metal".
Right, slightly misleading though, as https://en.wikipedia.org/wiki/Bare-metal_server is a separate page.
Yes, but if you look up "bare metal" it goes to the page about actual bare metal (aka "bare machines" or whatever).
Can we stop this now? Please?
> Yes, but if you look up "bare metal" it goes to the page about actual bare metal (or bare machines or whatever).
Fix it then, if you think it's incorrect. Otherwise, link to https://en.wikipedia.org/wiki/Bare_metal_(disambiguation) like any normal and charitable commentator would do.
> Can we stop this now? Please?
Sure, feel free to stop at any point you want to.
There is nothing that needs fixing? Both my link and yours give the same "primary" definition for "bare metal". Which is not unequivocally the correct definition, but it's the one I and the person I was replying to favour.
I thought my link made the point a bit better. I think maybe you've misunderstood something about how Wikipedia works, or about what I'm saying, or something. Which is OK, but maybe you could try to be a bit more polite about it? Or charitable, to use your own word?
Edit: In case this part isn't obvious, Wikipedia redirects are managed by Wikipedia editors, just like the rest of Wikipedia. Where the redirect goes is as much an indication of the collective will of Wikipedia editors as eg. a disambiguation page. I don't decide where a request for the "bare metal" page goes, that's Wikipedia.
In similar way I once worked on a financial system, where a COBOL-powered mainframe was referred to as "Backend", and all other systems around it written in C++, Java, .NET, etc. since early 80s - as "Frontend".
Had somewhat similar experience, the first "frontend" I worked on was a sort of proxy server that sat in front of a database basically, meant as a barrier for other applications to communicate via. At one point we called the client side web application "frontend-frontend" as it was the frontend for the frontend.
I don't work in firmware at all, but I'm working next to a team now migrating an application from VMs to K8S, and they refer to the VMs as "bare metal" which I find slightly cringeworthy - but hey, whatever language works to communicate an idea.
I'm not sure I've ever heard bare metal used to refer to virtualized instances. (There were debates around Type 1 and Type 2 (hosted) hypervisors at one point but haven't heard that come up in years.
Several years off AWS, the only thing I still prefer AWS for is SES, otherwise Cloudflare has the more cost effective managed services. For everything else we use Hetzner US Cloud VMs for hosting all App Servers and Server Software.
Our .NET Apps are still deployed as Docker Compose Apps which we use GitHub Actions and Kamal [1] to deploy. Most Apps use SQLite + Litestream with real-time replication to R2, but have switched to a local PostgreSQL for our Latest App with regular backups to R2.
Thanks to AI that can walk you through any hurdle and create whatever deployment, backup and automation scripts you need, it's never been easier to self-host.
[1] https://docs.servicestack.net/kamal-deploy
>We're now moving to Talos. We PXE boot with Tinkerbell, image with Talos, manage configs through Flux and Terraform, and run conformance suites before each Kubernetes upgrade.
Gee, how hard is to find SE experts in that particular combination of available ops tools? While in AWS every AWS certified engineer would speak the same language, the DIY approach surely suffers from the lack of "one way" to do things. Change Flux with Argo for example (assuming the post is talking about that Flex and no another tool with the same name), and you have a almost completely different gitops workflow. How do they manage to settle with a specific set of tools?
If you're that much of a slave to your tool chain you don't get to call yourself an engineer.
I would not want to hire an engineer who claimed to be proficient with any cloud Kubernetes stack but couldn’t learn Talos in a week.
Argocd and flux are "almost completely different"? The last time I looked was about a year ago, and there seemed to be only minor differences.
What are the major differences?
It's an interesting article, thanks for that.
What people forget about the OVH or Hetzner comparison is that for those entry servers they are known for, think the Advance line with OVH or AX with Hetzner. Those boxes come with some drawbacks.
The OVH Advance line for example comes without ECC memory, in a server, that might host databases. It's a disaster waiting to happen. There is no option to add ECC memory with the Advance line, so you have to use Scale or High Grade servers, which are far from "affordable".
Hetzner per default comes with a single PSU, a single uplink. Yes, if nothing happens this is probably fine, but if you need a reliable private network or 10G this will cost extra.
I can't believe how affordable Hetzner is. I just rented a bare metal 48 core AMD EPYC 9454P with 256 GB of ram and two 2 TB NVME ssds for $200/month (or $0.37 per hour). Its hard to directly compare with AWS, but I think its about 10x cheaper.
I never understood the draw of 'server-grade hardware'. Consumer hardware fails rarely enough that you could 2x your infra and still be paying less.
Their current advance offerings use AMD EPYC 4004 with on-die ECC. I can’t figure out if it’s “real” single correction double detection, or if the data lines between the processor and dimms are protected or not though.
It's only on-die ECC not real ECC
Yes, but there are options for dedicated server providers who offer dual PSU and ECC ram etc. It's more expensive though for e.g a 24 Core Epyc with 384GB RAM dual 10G netowork is like $500/month (though there's smaller servers on serversearcher.com for other examples)
These concerns are exaggerated. I've been running on Hetzner, OVH and friends for 20 years. During that time I've had only two issues, one about 15 years ago when a PSU failed on one of the servers, and another a few years ago when an OVH data center caught fire and one of the servers went down. There have been no other hardware issues. YMMV.
They matter at scale, where 1% issues end up happening on a daily or weekly basis.
For a startup with one rack in each of two data centers, it’s probably fine. You’ll end up testing failover a bit more, but you’ll need that if you scale anyway.
If it’s for some back office thing that will never have any load, and must not permanently fail (eg payroll), maybe just slap it on an EC2 VM and enable off-site backup / ransomware protection.
Is there software that works without ECC RAM ? I think most popular databases just assume memory never corrupts .
I'm pretty sure they keep internal internal checksums at various points to make sure the data on disk is intact - so does the filesystem, I think they can catch when memory corruption occurs, and can roll back to a consistent state (you still get some data loss).
But imo, systems like these (like the ones handling bank transaction), should have a degree of resiliency to this kind of failure, as any hw or sw problem can cause something similar.
The article mentions Equinix Metal but if you look it up they are shutting down the service https://docs.equinix.com/metal/hardware/standard-servers
Doesn't make me want to be a Equinix customer when they just randomly shut down critical hosting services.
I'm pretty sure that it's just the post-merger name for Packet which was an incredible provider that even had BYO IP with an anycast community. Really a shame that it went away, it was a solid alternative to both AWS and bare metal and prices were pretty good.
There's a missing middle between ultra expensive/weird cloud and cheap junk servers that I would really love to see get filled.
Fwiw equinix metal was an acquisition (Packet). Seems like it didnt go too well
I have seen multiple startups paying thousands of dollars a month in AWS bills to run a tiny service which could trivially run on an $800 desktop on a residential internet connection. It's absolutely tragic.
That’s like $24K a year. Assuming they have working failover and business continuity plans, it’s actually a really good deal (vs having a 10-20% time employee deal with it).
AWS doesn't get magically expensive just because you put your website there.
You don't get to an overcomplicated AWS madness without having a few engineers already pushing complexity.
And an overcomplicated setup also means it needs maintenance.
Curious to know how's the development experience been post-migration? Was there additional friction due to lack of tooling in on-prem that would otherwise available in the cloud env for example?
They were running for a long time (months? over a year?) on a single rack in a single datacenter. Eventually they scaled out but the word is eventually. I think that summarizes both sides of this debate in a nutshell. You can move off of AWS but unless you invest a lot you will take on increased risk. Maybe you'll get lucky and your one rack won't burn down. Maybe you won't. They did get lucky.
> Maybe you'll get lucky and your one rack won't burn down
Given the rates of fires in DCs, you'd rather need to be quite unlucky for it to happen to you.
Hm.. I wonder what the risk of a rack going offline is? Maybe 5% in a given year? Less? More?
Compared to all the other things that can and will go wrong, this risk seems pretty small, but I have no data to back that up.
From the story, they seem to have kept the option to fallback on AWS.
Anycast, Argo Rollouts, Aurora Serverless, AWS, BGP, Ceph, ClickHouse, Cloudflare, CloudFront, DWDM, Flux, Frankfurt, Glacier, Helm, Kinesis, Kubernetes, Metabase, MicroK8s, NVMe, OneUptime, OpenTelemetry Collector, Paris, Postgres, Posthog, PXE, Redis, Step Functions, Supermicro, Talos, Terraform, Tinkerbell, VM's.
I wish you started out by telling me how many customers you have to serve, how many transactions they generate, how much I/O there is.
Right! I can’t believe they decided to ditch the OS entirely and maintained availability like that!
Recently i learned that orgs these days want to show software and infrastructure spend as capex as they can shown it as depreciating asset for tax purposes.
I understand that with AWS you cannot do that as it is often seem as opex.
I guess thats a good enough motivation to move out of AWS at scale.
Reason to use AWS from the article:
> You do not have the appetite to build a platform team comfortable with Kubernetes, Ceph, observability, and incident response.
Has work been using AWS wrong? Other than Ceph, all those things add up to onerous half time jobs for rotating software engineers.
Before gp3 came out, working around EBS price/performance terribleness was also on the list.
Talos is great until it's not. We ran into Ceph IO speed bottlenecks and found it was impossible to debug ("talosctl cgroups —preset=io" is a mess) because the devs didn't want to add an SSH escape hatch into their black box OS. Our Talos nodes would also randomly become unhealthy and you have no way of knowing why. Switched to PXE booted Alpine linux with vanille k8s, and we had a much more stable experience with no surprises, and the ability to SSH whenever we want has been hugely helpful.
The thing I find counter intuitive about AWS and hyper-scalers in general is, they make so much sense when you are starting out a new project. A few VMs, some gigs of data storage, you are off to the races in a day or two.
As soon as you start talking about any kind of serious data storage and data transfer the costs start piling up like crazy.
Like in my mind, the cost curve should flatten out over time. But that just doesn't seem to be the reality.
Ok so this may be a dumb question...but now do you handle ISP outages due to storms and stuff with on prem solutions? I'd imagine large datacenters have much more sophisticated and reliable internet connections than say an Xfinity business customer, but maybe that's wrong.
Much more sophisticated and reliable than Xfinity.
Good datacenters have redundant and physically separated power and communication from different providers.
Also, in case something catastrophic happens at one datacenter, the author mentions they are peered to another datacenter in a different country, as another layer of redundancy. Cloudflare handles their ingress, so such a catastrophic event wouldn't likely to be noticed by their customers.
Equinix Metal is now EOL, so worth bearing that in mind..
Thank you for the share this is really good information for making expensive decisions!
Never heard of Talos before now. That looks pretty cool and I might start playing with that on my home lab. Can't use it at work for reasons, but good to keep on top of tech (even if I am a little behind)
This dude did a complete walkthrough setting up a Talos cluster on bare metal: https://datavirke.dk/posts/bare-metal-kubernetes-part-1-talo... It's a nice read. I have my own Talos cluster running in my homelab now for over a year with similar stuff (but no Ceph).
Quite recently I made a TCO analysis between AWS and bare metal Hetzner including salary. https://beuke.org/hetzner-aws/
Ok but what about a dedicated OVH for example? Those are about 70% cheaper than AWS, so is it still worth it to colo?
Did you read the article ? The main point of this and the prior article is that YES colocation/baremetal IS a better option for this company (and I would argue the majority of AWS users)
reference : https://news.ycombinator.com/item?id=38294569
Managed DB costs a lot.
Is there a simple safe setup that we can run on an Ubuntu server?
We self-host the Postgres db with frequent backups to s3 but just in case the site takes off, we need an affordable reliable solution.
Does anyone here run their own db servers? Any advise?
Backups, security, upgrades etc
I love the argument that Managed DBs cost a lot, but they're supposedly safer. Meanwhile people can't figure out the IAM permission models so they give the entire world access with root:root.
If you're running k8s cluster. Check out cloudnative pg. That thing is a beast.
We have hosted on everything on a tiny Hetzner. The site barely has any users apart from our friends:) :(
Info noted
Worth checking out the different server hosts. You can get a cheap OVH server with 64GB of RAM, 4-6cores with 2TB of disk space from OVH for $30, better servers for $70 with 1gbps - 2gbps bandwidth.
Setting up a DB isn't hard, using an LLM to ask questions will guide you to the right places. I'm always talking with Gemini because I switched from Ubuntu to Fedora 42 server and things are slightly different here and there.
But, different server hosts offer DB-ready OS's so all you have to do is load the OS on the server and you'll be ready to go.
The joy of Linux is getting everything _just right_ and so much _just right_ that you can launch a second server and set it up that way _just right_ within minutes.
This is a tech company and it’s adjacent to their core competency. Most companies wouldn’t know MicroK8s from a brand of cereal, they’d only create a mess if they tried this themselves.
Sure, but they also create a mess in AWS
One thing I can say definitively, as someone who is definitely not an AI zealot (more of an AI pragmatist): GPT language models have reduced the barrier of running your own bare metal server. AWS salesfolk have long often used the boogeyman of the costs (opportunity, actual, maintenance) of running your own server as the reason you should pick AWS (not realizing you are trading one set of boogeymen for another), but AI has reduced a lot of that burden.
> AWS is extremely expensive.
I really like how people throw around these baseless accusations.
S3 is one of the cheapest storage solutions ever created. The last 10 years I have migrated roughly 10-20PB worth of data to AWS S3 and it resulted in significant cost saving every single time.
If you do not know how to use cloud computing than yes, AWS can be really expensive.
Assuming those 20PB are hot/warm storage, S3 costs roughly $0.015/GB/month (50:50 average of S3 standard/infrequent access). That comes out to roughly $3.6M/year, before taking into account egress/retrieval costs. Does it really cost that much to maintain your own 20PB storage cluster?
If those 20PB are deep archive, the S3 Glacier bill comes out to around $235k/year, which also seems ludicrous: it does not cost six figures a year to maintain your own tape archive. That's the equivalent of a full-time sysadmin (~$150k/year) plus $100k in hardware amortization/overhead.
The real advantage of S3 here is flexibility and ease-of-use. It's trivial to migrate objects between storage classes, and trivial to get efficient access to any S3 object anywhere in the world. Avoiding the headache of rolling this functionality yourself could well be worth $3.6M/year, but if this flexibility is not necessary, I doubt S3 is cheaper in any sense of the word.
Like most of AWS, it depends if you need what it provides. A 20PB tape system will have an initial cost in the low to mid 6 figures for the hardware and initial set of tapes. Do the copies need to be replicated geographically? What about completely offline copies? Reminds me of conversations with archivists where there's preservation and then there's real preservation.
How the heck does anyone have that much data? I once built myself a compressed plaintext library from one of those data-hoarder sources that had almost every fiction book in existence, and that was like 4TB compressed (but would've been much less if I bothered hunting for duplicates and dropped non-English).
I suspect the only way you could have 20PB is if you have metrics you don't aggregate or keep ancient logs (why do you need to know your auth service had a transient timeout a year ago?)
Lots of things can get to that much data, especially in aggregate. Off the top of my head: video/image hosting, scientific applications (genomics, high energy physics, the latter of which can generate PBs of data in a single experiment), finance (granular historic market/order data), etc.
In addition to what others have mentioned, before the "AI bubble", there was a "data science bubble" where every little signal about your users/everything had to be saved so that it could be analyzed later.
The implicit claims are more misleading, in my opinion: The claim that self-hosting is free or nearly free in terms of time and engineering brain drain.
The real cost of self-hosting, in my direct experience with multiple startup teams trying it, is the endless small tasks, decisions, debates, and little changes that add up over time to more overhead than anyone would have expected. Everyone thinks it’s going to be as simple as having the colo put the boxes in the rack and then doing some SSH stuff, then you’re free of those AWS bills. In my experience it’s a Pandora’s box of tiny little tasks, decisions, debates, and “one more thing” small changes and overhauls that add up to a drain on the team after the honeymoon period is over.
If you’re a stable business with engineers sitting idle that could be the right choice. For most startups who just need to get a product out there and get customers, pulling limited headcount away from the core product to save pennies (relatively speaking) on a potential AWS bill can be a trap.
> The claim that self-hosting is free or nearly free in terms of time and engineering brain drain.
Free? No, it's not free. It only costs less engineering time than AWS.
Running EKS on AWS was their problem. If they didn't run EKS on AWS, they would've had a considerably simpler setup running Amazon Linux, not having to upgrade Kubernetes every 3 quarters, managing network security using security groups instead of having open internal networking, and running in a single AZ would've eliminated intra-AZ costs. In large data centers like us-east-1, an individual AZ is actually internally striped for extra redundancy, and you are much more likely to experience regional downtime than single AZ downtime, especially if you have a stable workload and do not rely on tech beyond rock-solid basics (EC2, VPC, ELB, S3, EBS). If you're willing to operate a single bare metal rack in a DC, you should be willing to run in a single AWS AZ.
I don't know how much time they spend configuring/dealing with Kubernetes, but I bet it's a large chunk of the 24 hour engineer-hours per quarter. But this is not a required expense: "EKS had an extra $1,260/month control-plane fee". Running EKS adds a massive IAM policy maintenance overhead, whereas a non-EKS (EC2 w/ golden AMIs) setup results in drastically simpler IAM policies.
NAT gateways are ~$50 a month, plus data transfer. Setting up a gateway VPC endpoint to S3 will avoid having to pay transfer charges to S3.
They were at 90% reservation capacity, so they should be using reservations for greater savings and in fact, running stable workloads with reservations is something that AWS excels at. Reservation means that you will be able to terminate and re-launch instances even when there's a spike in demand from other users--your instance capacity is guaranteed.
Running the basics on VMs also effectively avoids vendor lock-in. Every cloud provider supports VMs with a RedHat clone, VPCs, load balancing, networked storage, access controls, object storage and a fixed size fleet with auto-relaunch on instance failure.
With a consistent workload, they would have very likely escaped the downtime from AWS a week ago as well, because, as per AWS, "existing EC2 instances that had been launched prior to the start of the event remained healthy and did not experience any impact for the duration of the event".
With Terraform and automation for building launchable images, you can stand up a cluster quickly in any region with secure networking, including in a separate AWS account, in the same region, for the sake of testing.
With AWS, you can set up automatic EBS backups of all your data to snapshots trivially, and even send them to a 3rd locked-down account, so they can't be accidentally wiped.
Bare metal is the best metal.
Never ever. True metal it is!
AWS is extremely expensive, and I think I have to agree with DHH's assessment that many developers are afraid of computers. AWS is taking advantage of that fear of actually just setting up linux and configuring a computer.
However to steelman AWS use. Many businesses are STILL running mainframes. Many run terrible setups like Access as a production database. In 2025 there are large companies with no CICD platforms or IAC, and some companies where even VC is still a new concept or a dark art. So not every company is in the position to actually hire competent system administrators and system engineers to set up some bare metal machines and configure Ceph, much less Hadoop or Kubernetes. So AWS lets these companies just buy this capabilities while forcing the software stack to modernize.
I worked at a company like this, I was an intern with wide eyes seeing the migration to git via bitbucket in the year ... 2018? What a sight to see.
That company had its own data center, tape archives, etc. It had been running largely the same way continuously since the 90s. When I left for a better job, the company had split into two camps. The old curmudgeonly on-prem activists and the over-optimistic cloud native AWS/GCP certified evangelist with no real experience in the cloud (because they worked at a company with no cloud presence). I'm humble enough to admit that I was part of the second camp and I didn't know shit, I was cargo culting.
This migration is still not complete as far as I'm aware. Hopefully the teams that resisted this long and never left for the cloud get to settle in for another decade of on-prem superiority lol.
I was a at a company that was doing their SVN/Jenkins migration to Git/Bitbucket/Bamboo around 2016/2018. But they were using source control and a build system already, so you have to hand it to them. But I have an associate that was at one of the large health insurance companies in 2024, complaining that he couldn't get them to use git and stop deploying via FTP to a server. There is danger with being too much on the cargo cult side, but also danger with being too resistant to change. I don't know how you can look at source control, a CICD pipeline, artifacts, IaC, and say "This looks like a bad idea".
Have they done a complete failover to their second data center? It wasn’t clear how committed of a failover it was during the tests.
Microk8s has common, catastrophic performance bugs. There are also catastrophic problems with microk8s Ceph addons. So is this post true? Microk8s, for people who know stuff, is a canary for clusters / applications that don’t really work.
Source? Links?
We havent found those bugs in our cluster, but we're also moving to Talos (but for diff reasons)
There is so much hidden cost in maintaining your own bare metal infrastructure. I am always astounded by how people overlook the massive opportunity cost involved in not only setting up, securing, and maintaining your bare metal infrastructure, but also make it state of the art, including best practices, making sure you have required uptime, monitoring and intervening if necessary. - I work in a highly regulated market with 700 coworkers, our IT maintains an endless amount of VMs. And you cannot imagine how much more work they have to do compared to a setup where you spin up services in AWS or Azure. And destroy it when you don’t need it. No updates, no patches. No misconfiguration. Not every company uses automation either (chef, ansible and whatnot)
I agree, I have a restaurant POS system and I think self-hosting would easily kill the product velocity, and if we screw up bad, even the company.
However, I do get the point about cost-premium and more importantly vendor-risk that's paid when using managed services.
We are hosted on cloudflare workers which is very cheap, but to mitigate the vendor risk we have also setup up replicas of our api servers on bunny.net and render.com.
This is a completely meaningless article if they don't provide information about their technical stack, which AWS services they used to use, what TPS they are hitting, what storage size they're using, etc.
The story will be different for every business because every business has different needs.
Given the answer to "How much did migration and ongoing ops really cost?" it seems like they had an incredibly simple infrastructure on AWS, and it was really easy to move out. If you use a wider-range of services the cost savings are much more likely to cancel themselves.
TFA begins with a link to the original article with those details.
If you called "We used EKS" details, then yeah they provide those details.
Assuming this is indeed all they used, this was admittedly nonsense, they were essentially using cloud-based bare-metal.
Sounds like they did the right thing for their business model.
I think as AWS grows and changes the curve of the target audience is changing too. The value proposition is "You can get Cloud service without having a dedicated Cloud team," but there are caveats:
- AWS is complicated enough that you will still need a team to integrate against it. The abstractions are not free and the ones that are leaky will bite you without dedicated systems engineers to specialize in making it work with your company's goals.
- For small companies with little compute need, AWS is a good option. Beyond a certain scale... It is worth noting that big companies build their own datacenters, they don't rely on someone else's Cloud. Amazon, Google, and Microsoft don't run on each other.
- Recently, the cost model has likely changed if a company pokes their head up and runs the numbers, there's, uh, quite a few engineers with deep knowledge of how to build a scalable cloud infrastructure available to hire now for some reason. In fact, a savvy company keeping its ear to the ground can probably snap up some high-tier talent very soon (https://www.reuters.com/business/world-at-work/amazon-target...).
It really depends on where your company's risk and cost models are. Running on someone else's cloud just isn't the only option.
I really dislike how this industry oscillates between various states of epiphany that things that are overcomplicated and expensive are overcomplicated and expensive. As an industry, we must look like utter clowns to the world. It's really sad that saying "own or control your own servers" seems to be a sword in the stone moment for far more people than it should. Things that used to be a "duh" are now a "wow" and it's deeply unsettling to watch.
For smaller operations I’d still go with a rent-a-server model with AWS. Theirs is a critical mass though where roll your own makes sense.
The long term app model on the market model is shifting much more towards buying services vs renting infrastructure. It’s here where the AWS case falls apart with folks now buying Planet Scale vs RDS, buying DataBricks over the mess that AWS puts for for data lakes, working with model providers directly vs the headaches of Bedrock. The real long term threat is AWS continues to whiff on all the other stuff and gets reduced to a boring rent-a-server shop that market forces will drive to be very low margin.
Yes a lot of those 3rd party services will run on AWS but the future looks like folks renting servers from AWS at 7% gross margin and selling their value-add service on top at 60% gross margin.
This doesn't really explain why you wouldn't just get a hetzner. I don't have much experience with either, but if you know how to setup your infra then hetzner seems like a no-brainer? I do not want to be tied to AWS where I have no idea what my bill will be
Depending on the use case you very much could just use Herzner. A simpler and more transparent customer experience than trying to navigate the mass complexity of AWS for basic stuff.
A bunch written about this recently by analysts. That is the “bear” outlook on AWS
With AI making it possible to use natural language to modify code, bare metal can make things easier to use with your own code and customization. Abstractions tend to be harder to reason about and have more limited functionality in exchange for being easier to get started on some standard setup.