Geico repatriates work from the cloud, continues ambitious infra overhaul

(thestack.technology)

48 points | by us0r a year ago ago

62 comments

I've directly participated in this project and all I have to say is this: the same madness that created a super complex and unmanageable environment in the cloud is now in charge of creating a super easy and manageable environment on premises. The PoC had barely been approved and there was already legacy stuff in the new production environment.

Geico's IT will slow to a crawl in the next years due to the immense madness of supporting Kubernetes on top of OpenStack on top of Kubernetes (yes, that's what they are doing).

[-]

stackskipton a year ago

This article read more like advertisement for VP spearheading all of this.

[-]

a year ago

[deleted]

lowbloodsugar a year ago

First you charge them to put a star on their belly, and then you can charge them to take the star off their belly!

[-]

mooreds a year ago

https://www.youtube.com/watch?v=hzMhmk2sWzU for anyone not in the know.

DougN7 a year ago

Yes!!

sofixa a year ago

> Kubernetes on top of OpenStack on top of Kubernetes (yes, that's what they are doing).

OpenStack's services are running in Kube? And Kube itself is ran as an OpenStack thing? Why? Why not use the same tooling used to deploy that initial Kube to deploy as many as needed? Still a massive maintenance burden, but you don't need to add OpenStack into the mix.

[-]

mrweasel a year ago

Because you can't necessarily run everything in Kubernetes, or in the same cluster. OpenStack probably provides VMs, private networks and bunch of other stuff to run legacy systems, 3rd. party software, Windows application, tons of stuff that can't be containerized.

You can have a large Kubernetes cluster running OpenStack, because it's probably the easiest way to deploy and maintain OpenStack. You then build smaller, isolated Kubernetes clusters on top of OpenStack, using VMs.

It's not as crazy as it sounds, but it does feel a little unnecessarily complex.

[-]

hamandcheese a year ago

I get why you might want to use open stack.

And I get why you might want to use open stack on Kubernetes.

What I don't get is why you would want Kubernetes on open stack on kubernetes.

[-]

fragmede a year ago

My money's on Conway's law. There's a hardware team that's in charge of the hardware, and they need to orchestrate ask the nodes, then there's the openstack team that's their customer who’s in charge of providing a cloud-like environment to the rest of the company including windows VMs, then there’s an applications kube team that provides kube for services that run on kube, with finally kube-ized applications teams that run on the very top.

mrweasel a year ago

One reason could be that you use Kubernetes as a deployment tool, but you don't actually need to full capacity of three bare metal servers. So you need to slice up the physical servers in some way, and Kubernetes can't do that.

From experience most Kubernetes clusters are actually large enough, in terms of capacity required, to justify using an entire modern server and companies are very reluctant to run a mix if various application on the same cluster. There are very very few organisations large enough to need bare metal servers as Kubernetes worker nodes. Unless you use them to run OpenStack.

a year ago

[deleted]

derefr a year ago

From what I've seen in other projects, I think that translates to:

1. we have a management k8s cluster where we deploy app blueprints

2. the app blueprints contain, among other things, specifications for VMs to allocate, which get allocated through an OpenStack CRD controller

3. and those VMs then get provisioned as k8s nodes, forming isolated k8s clusters (probably themselves exposed as resource manifests by the CRD controller on the management cluster);

4. where those k8s nodes can then have "namespaced" (in the Linux kernel namespaces sense) k8s resource manifests bound to them

5. which, through another CRD controller on the management cluster and a paired CRD agent controller on in the isolated cluster, causes equivalent regular resource manifests to be created in the isolated cluster

6. ...which can then do whatever arbitrary things k8s resource manifests can do. (After all, these manifests might even include deployments of arbitrary other CRD controllers, for other manifests to rely upon.)

All said, it's not actually that braindead of an architecture. You might better think of it as "k8s, with OpenStack serving as its 'Container Compute-Cluster Interface' driver for allocating new nodes/node pools for itself" (the same way that k8s has Container Storage Interface drivers.) Except that

1. there isn't a "Container Compute-Cluster Interface" spec like the CSI spec, so this needs to be done ad-hoc right now; and

2. k8s doesn't have a good multi-tenant security story — so rather than the k8s nodes created in these VMs being part of the cluster that spawned them, their resources isolated from the management-layer resources at a policy level, instead, the created nodes are formed into their own isolated clusters, with an isolated resource-set, and some kind of out-of-band resource replication and rewriting to allow for "passive" resources in the management cluster that control "active" resources in the sandboxed clusters.

RobRivera a year ago

All the whey down

Dios mio mayne

JohnMakin a year ago

Thank you for posting this - reading this set off a lot of alarm bells, and there's a loud, growing "on prem" marketing movement that is likely to trumpet this as the downfall of "cloud" that I wasn't particularly looking forward to arguing with.

0xbadcafebee a year ago

They had an expensive, fractured, hard to maintain on-prem layout. Then they moved to the cloud. And it turned out the cloud was expensive, fractured, and hard to maintain. So they're moving to on-prem.

Any bets on what's going to happen next?

[-]

mmcconnell1618 a year ago

The comment about "running legacy applications in the cloud was not any cheaper" stood out to me. Just moving the same legacy design into the cloud is not the optimal way to gain cost and availability improvements.

If you have ever seen a data center from Azure, GCP or AWS, you will realize how difficult it will be for any company to compete in the long run. Those companies develop new generations of data center infrastructure with power efficiency improvements every single year. They negotiate network and power contracts at a scale that exceeds any typical Fortune 500 company. I'm skeptical that running your own data center will end up a cost saver in the long run.

[-]

kkielhofner a year ago

> They negotiate network and power contracts at a scale that exceeds any typical Fortune 500 company.

..and then mark it up. AWS overall has 38% operating margin[0]. Depending on your application this can hit you really hard (cloud egress bandwidth being an especially obscene offender).

> I'm skeptical that running your own data center will end up a cost saver in the long run.

It's not cloud -or- your own Azure-scale datacenter. There are any number of approaches in between including hybrid to offload stuff like CDN, storage, edge services, etc to cloud but the fact remains many companies can run the entire business from a few beefy machines in co-location facilities. Most companies, solutions, etc are not actually Google, Snapchat, Geico, etc scale and never will be.

Throw in some minor accounting tricks like leasing (with or without Section 179) and these kinds of "creative" approaches are often impossible to beat from a pricing/performance and even uptime standpoint. That's certainly been my experience.

[0] - https://www.theinformation.com/articles/why-aws-fat-margins-...

[-]

TheNewsIsHere a year ago

Speaking as someone who has seen (partially) behind the hyperscale curtain, I wholeheartedly agree.

Competing with an Azure, AWS, or GCP data center would absolutely be a _really_ expensive proposition, but it’s not something most Fortune 500s need (or want) to do. Hyperscaler data centers are intentionally designed to effectively be both available to (almost) every possible customer while also adhering to (almost) every GRC framework, redundancy metric, and security requirement that most of those potential customers may ask for.

If you’re running your own data center, you don’t need to worry about most of that. You only have to worry about your own needs or that of your customers.

The misconception that it’s either-or, or that the cloud is the prime solution for all use cases, is simply the result of really effective evangelism and marketing. That so many people working in software don’t have deep hardware expertise or are not familiar with data centers plays to that hand. Not a criticism, just an observation from my experiences.

Not that the cloud isn’t a very powerful option indeed.

[-]

kkielhofner a year ago

> That so many people working in software don’t have deep hardware expertise or are not familiar with data centers plays to that hand.

I like to remind myself that AWS is 20 years old. That's an entire generation of people from devs to C-Suite that likely don't know anything else. For many of these people all they know about hardware is their laptop. All they know about bandwidth is what they pay their local ISP. All they know about storage is (maybe) USB flash drives and what Apple charges for 256GB vs 512GB.

This is not a criticism. More of a reality check to myself and others that at this point not a lot of people outside of bigger cloud and ISPs know what an Autonomous System is or what buying transit and peering costs. Nor have they bought a cabinet of bare metal or talked to a co-lo provider.

> Not a criticism, just an observation from my experiences.

Exactly!

a year ago

[deleted]

HideousKojima a year ago

Colocation is always an option

wnevets a year ago

> Any bets on what's going to happen next?

Someone in the c-suite gets a massive bonus before moving to a new company.

miyuru a year ago

according to the blog they started the cloud migration in 2013, there have been lot of improvements/changes to on-prem since then.

whatever1 a year ago

If you don't have strong seasonality or not expecting a significant ramp up of compute demand (true for startups) why bother with the cloud?

It is not more secure, I read every quarter about downtime events, and more importantly you have 0 control of your costs.

Your company is likely not Amazon, you will do fine if you have your on prem computers.

[-]

oneplane a year ago

It's not really about cloud vs. on-prem, it's the fact that people cut corners and lack knowledge on-prem, and don't have the budgets to do anything about it.

What you're referring to is mostly about elasticity, and it's true that if you don't need it, it doesn't make sense to pay for it.

But that doesn't mean that on-prem (which almost always turns into a virtual machine shitshow with crappy network design -- which will continue as long as nobody implements things like strong IAM and Security Groups in their on-prem setups) is 'the same' as cloud but just in a physical location you control.

The inverse is also true. If you just run some VMs 'in the cloud', you're doing it wrong. Playing datacenter is just as bad as not moving away from classic virtual machines, cloud or no cloud.

[-]

whatever1 a year ago

So when they are setting up config files for the cloud they don't cut corners? It is insane amount of work to follow safe practices to configure your cloud.

I don't see that much difference compared to doing actual admin tasks.

[-]

oneplane a year ago

The entire underlying layer of possible misconfigurations is absent in the cloud. Yes, the services on top of that can still be misconfigured, but you don't get access to hosts, SANs, switches, firewalls, gateways, there isn't anything for you to mess up. The shared responsibility model allows you to also pick even more robust options.

But even if you were to stick to something simple, say, object storage. A bucket or blob store has no SAN config, no webserver config, no switches, no gateways, no raid controllers, no striping, mirroring, parity configuration, no firmware, no BIOS, no BMC, no OS. None of that. It's all eliminated. All that remains is the top layer where you configure your cost-to-resilience ratio and your access policy. And yes, you could cut corners, but those are orders of magnitude fewer corners you could be cutting than if you include all the stuff below it.

Add to that: almost all of it has good APIs that are well defined, well supported and have an ecosystem to go with it. Try finding anything like that for a crappy NetApp or EMC appliance you find in a datacenter. It either doesn't exist, or it's so bad you might as well run MinIO or a bloody NFS share (not actual object storage) yourself.

Being bad at cloud is definitely more expensive than being bad at on-prem, I'll give you that. But with cloud, at least you get a bill that you can use to show your peers and higher ups that being bad has a cost. Internal virtual/amortised dollars are much harder to allocate to incompetence. It's often completely ignored, and at best revisited at periodic capacity planning reviews with few to no consequences.

The only place on-prem has, is with locality requirements. That includes latency sensitive things where sub 1ms is a goal, and air gapped things. But even in the first case things like an AWS Outpost exist, and those are cheaper than doing it yourself (not much, but enough to save on the hardware and on 2 FTEs).

[-]

whatever1 a year ago

My friend some the biggest data leaks happened because of misconfigured S3 buckets which is literally one line of code to get right.

Cloud is not an insurance against incompetence.

[-]

TheNewsIsHere a year ago

And it opens you up to potential exposure due to mistakes at the cloud provider.

About two years ago we got an email from AWS associated with a PHD notice. It “apologized” for an issue whereby the EC2 Security Groups in a single AZ were in place but not operative. All traffic was permitted for several hours, irrespective of the SG config.

We deploy and align host-based firewalls alongside whatever the cloud provider gives us, for exactly this reason.

Somewhere along the line “the cloud” seems to have gotten a reputation for some level of infallibility of which I’m not convinced.

See also the recent problem where Entra logs weren’t captured for some tenants, and are just gone.

oneplane a year ago

I didn't mention there were no leaks or is no incompetence. I wrote about the amount of corners that are no longer available to be cut. Corner cutting isn't exclusive to data leaks. It impacts everything, mostly the people actually working on the stuff.

Taking away responsibility from the people or departments that clearly can't handle it, that is what this means.

It does not mean that the responsibility that remains suddenly does no longer end up with incompetent actors. It just means it is now smaller, and smaller to a degree where it is very much worth it in most cases.

And just like I wrote earlier, there are cases where that works the other way around as well, and that just reinforces my point.

jjav a year ago

> The entire underlying layer of possible misconfigurations is absent in the cloud.

This is true.

Let's not forget there is a whole new, quite different, layer of potential (and easy) misconfigurations that exist only in the cloud, so it balances out.

When you can accidentally expose services with a single mouse click where it used to take someone with access to the server room going in and grabbing a cable and wiring it wrong, this category of problem is a lot more common now.

[-]

oneplane a year ago

There is a middle era between a cable in a datacenter and a misclick in a cloud. Currently, on-prem is still 1 misclick away from accidental exposure (unless it's been untouched for 20+ years).

Be it with a legacy DMZ setup or a bit more segmented with a ADC/Proxy policy that is slightly too wide. You can make those exact same mistakes with a stack of PaloAlto/Cisco/F5/IIS.

Unless you're running an entire OpenStack setup with SDN layers and policies (hit: most on-prem setups don't), there is a crapton of re-use when it comes to systems, and a classic webserver that used to be just for public stuff will just as much have some private applications added 'temporarily' (read: forever) and a crappy WAF / Proxy rule that is supposed to deny public access but gets bypassed with a simple URLEncode.

Doing the lower layers requires knowledge and dedication, of which the first is getting harder to find (not easier) and the second is getting squeezed out of most processes since it isn't something that gets quantified as value.

So no, it doesn't balance out, and no, the cloud doesn't do a magical new layer of things that on-prem couldn't do, even if on-prem usually fails to deliver on an abstraction layer (while the cloud does have it). A cloud does make it much more visible, cost-wise and impact-wise, because you can't hide in a cloud. What goes into a cloud API also comes out of the cloud API, there is no network scanning and hoping you find all hosts and appliances, everything that exists can be queried, and also gets billed with plenty of detail. On-prem has none of that, and the last 30 years of inventory/asset management attempts has proven that it's still something most on-prem setups don't do at all, or do a really crappy job at.

mrweasel a year ago

That's really what some/most companies want, a platform that can run cheap, fast and easy VMs, like on-prem, but without the hassle of having to deal with the hardware and physical network part, like in the cloud. Sadly that's not the choice being offered.

I don't know, I've seen the shittiest stuff built on-prem and in cloud, and I've seem completely amazing on-prem infrastructure and cloud stuff that could not possibly be built outside AWS.

bluGill a year ago

If your data center isn't large enough to need at least 5 people full time admins then you should just go cloud. With a part time person you will see downtime when a machine fails. With 1 person that person will sometimes be on vacation when a zero day takes you down. With 2 people 1 will be on vacation when the second gets sick. You end up needing at least 5 people before you have enough people that you have redundancy for humans issues and the ability to train people in whatever is the latest needed.

Of course even in the cloud you still need to apply security patches to everything. However it still saves a lot of issues and thus money in all but the largest setups.

[-]

x0x0 a year ago

> With a part time person you will see downtime when a machine fails.

Many data centers offer remote hands services. And I don't believe this is at all true.

I worked at a place that managed thousands of boxes in dozens of pops with 1.5 fulltime people. If you design it for this from the beginning, with cattle not pets and netboot everywhere, this is very doable. And a large cost savings vs cloud.

[-]

bluGill a year ago

The assertion was about bringing this onprem so you don't get that offer of remote hands service. A data center instead of onprem is a valid option and might be best - check the contract and services the provide for you carefully.

munk-a a year ago

Additionally, as someone who has been a part of the interview process for IT people, if you only have two people and you're not an expert yourself there's a non-neglible chance that neither of the two people you've got are particularly good at their job. I'd advise any company to just accept the premium cost of using cloud services rather than risk getting ransomewared or what-have-you and finding out nobody ever actually tested the backups.

The cost of getting things wrong with on-prem aren't high on the average - but they sure are spikey if you get unlucky.

kkielhofner a year ago

> With a part time person you will see downtime when a machine fails

If a hardware failure causes downtime you're doing it wrong. Additionally, big cloud scaring people from hardware with marketing and FUD has been very effective. Modern hardware is insanely reliable and performant - I don't think I've seen a datacenter/enterprise NVMe drive fail yet. It's not 2005 with spinning disks and power supplies blowing up left and right anymore.

> With 1 person that person will sometimes be on vacation when a zero day takes you down. With 2 people 1 will be on vacation when the second gets sick. You end up needing at least 5 people before you have enough people that you have redundancy for humans issues and the ability to train people in whatever is the latest needed.

Hardware vendors (Dell, etc) have highly-discounted warranty services. In the event of a hardware failure you open a ticket and they dispatch someone directly to the facility (often within hours by SLA) and it gets handled.

Same thing for shipping HW directly to co-lo and they rack/cable/bootstrap for a nominal fee, remote hands for weird edge-cases, etc.

A lot of takes here and elsewhere seem to be either big-cloud or Meta-level datacenter. I have operated POPs in a dozen co-location ("datacenter") facilities (a cabinet or two each) no one on staff ever stepped foot in with hardware we owned (and/or financed) that no one ever saw or touched. We operated this with two people looking after it as part of their broader roles and responsibilities and frankly they didn't have much to do.

There is an entire industry that provides any number of highly flexible and cost-effective approaches for everything in between.

[-]

stackskipton a year ago

To me, the downside of on premise hardware isn't hardware swap out, it's just dealing with hardware in general. All hardware needs updates which is downtime for that hardware. Also, anyone in this industry long enough has been around for "Oh, we will just replace that broken piece of hardware" that ended up "WHY IS EVERYTHING ON FIRE?" because versions didn't match up, hardware was rejected or just plain "Actually, THAT failure mode isn't redundant."

That can happen to Public Cloud as well but since they work with hardware at much much larger scale and most of time, build actual hardware software, they are much more aware of sharp edges.

Finally, with Broadcom acquisition, what virtualization software are using and is it really cheaper then the cloud?

[-]

kkielhofner a year ago

> Also, anyone in this industry long enough has been around for "Oh, we will just replace that broken piece of hardware" that ended up "WHY IS EVERYTHING ON FIRE?" because versions didn't match up, hardware was rejected

I've been doing this for 25 years and I'm not sure what this means. Dell isn't going to come back to you and say "sorry but we can't fix this". With the warranty SLA worst case scenario they'll just replace the entire machine if they have to although I don't remember ever seeing it come to that.

> just plain "Actually, THAT failure mode isn't redundant."

When it comes down to it similar issues exist with clouds - regions, availability zones, etc. Big clouds have had multiple widespread outages just this year[0].

From that reference you can see that MS and Amazon themselves struggle to design, build, and run solutions for their own products in their own clouds.

It's always interesting to see marquee household name companies/products/solutions go down when US-East (or whatever) is having a bad day again.

Cloud can be a lot of things but a silver bullet for reliability and uptime isn't one of them.

[0] - https://www.forbes.com/sites/emilsayegh/2024/07/31/microsoft...

[-]

stackskipton a year ago

>I've been doing this for 25 years and I'm not sure what this means. Dell isn't going to come back to you and say "sorry but we can't fix this".

Dell/EMC says "Hey, here is drive replacement." We do it, 2 hours later, the volume is knocked offline. Apparently, there was mismatch between backplane version, drive version and through some weird edge case, it knocked the volume offline. Yes, they fixed it, no it wasn't pretty since a bunch of applications had to be recovered.

No, public clouds are not 100% reliable either. It's just their failures tend to be you twiddling your thumbs vs hair on fire on phone with the vendor trying to get it resolved.

[-]

kkielhofner a year ago

> Dell/EMC says "Hey, here is drive replacement." We do it, 2 hours later, the volume is knocked offline. Apparently, there was mismatch between backplane version, drive version and through some weird edge case, it knocked the volume offline. Yes, they fixed it, no it wasn't pretty since a bunch of applications had to be recovered.

Anecdotal (as is my position). I can theoretically understand this happening but not only have I never seen it, such an issue would need to be escalated. That's a "this is unacceptable" high-level phone call. A call you more than likely have a chance of someone in actual authority answering because IME unless you have SERIOUS spend with big cloud you'll be lucky to make it a rung or two up sales/support.

Plus backups and redundancies that should prevent even the failure of a chassis/storage/etc from being a significant critical issue.

> their failures tend to be you twiddling your thumbs vs hair on fire on phone with the vendor trying to get it resolved

As a Founder/CTO I have the opposite take - put me and my team in a position to /do something/ vs sitting around waiting for AWS to come back whenever it decides to and while they obscure comms, don't update the fake status dashboards, etc. Meanwhile you're telling your customer "Ummm, we don't know - Amazon has a problem. When it comes back I guess it's back".

Coming from a background of telecom, healthcare, and nuclear energy I can't believe that even flies.

milesward a year ago

Find me a list of customers on cloud who got hacked, vs folks on-prem. I've got 3k+ customers, I know which one I see 99.99% of the time...

[-]

whatever1 a year ago

I guess you don't count misconfigurations. But deciding between the cloud vs local is a choice between config or admin.

0xbadcafebee a year ago

Our company is literally in the 3rd week of waiting for a colo to install some new RAM modules in a server. Before that we were waiting two weeks to get a new server ordered, delivered and racked. Before that we had to wait a week for them to tell us if there was available power and network ports for the new server.

That server is the main database. And yes, there is a backup server, but for reasons, the backup server isn't working as expected. So if that main server's RAM failed for good, there goes our product, for god knows how long, considering how long it's taken so far to get a second one set up.

You don't have to deal with any of that shit in the cloud. None. You just spin up a new server in 2 seconds. You don't deal with shitty hardware, or the differences between old and new hardware (besides cpu arch, and some special classes), or incompatibilities, or running out of space, or getting smart hands in your rack, or a million other things.

And that's just the hardware side. The software side of the cloud is the one million unique hosted services they offer that you can just start using immediately. No server set-up, no configuration management, it already has security baked in, it's already integrated with the other million services, etc. You just start using it, immediately, and it just works. It saves you time, complexity, maintenance, and it gives you reliability, compatibility, flexibility, and allows you to ship something earlier.

I have managed servers on-prem for years, for tiny startups and huge companies. Both two decades ago, and two years ago. Without a doubt, I would always suggest any kind of hosted, cloud-style vendor over on-prem. Only somebody needs to be on-prem, or they literally are a teenager with no money at all and all the time in the world to waste DIYing, then I would tell them to go on-prem.

alexjplant a year ago

Disclaimer: this is anecdotal so n=1. All opinions are my own. No value judgment one way or another is expressed or implied.

Professional developers these days are primarily concerned with 1) getting their service running 2) as quickly as possible 3) someplace where they have instant access and control of it. Clicking around a cloud console accomplishes all three of these and allows you to write "Delivered the ____ service in 3 months that generates $XX M/year" on a performance review in short order. Having to build, rack, and configure a physical server or deal with "IT" (which has somehow become something separate from software engineering) does not. Because the developers are the ones delivering value they get to decide how it's done. AWS gets it done. A server in a datacenter in Texas that requires an SSH keypair to reach doesn't.

Your average SDE L4 does know or care about init systems or SANs or colos or 802.1q or any of the myriad of things required to run on-prem infra. They write software. Software makes money and so the business makes money - wash, rinse, repeat. Why would you have people on the front lines of your revenue stream worrying about these things when you can have a hyperscaler with a control plane do it for a nominal fee?

[-]

whatever1 a year ago

If the hyperscaler asks for 200% of my revenue then yes.

[-]

alexjplant a year ago

But they don't. They ask for a deterministic usage-based amount.

VirusNewbie a year ago

> expecting a significant ramp up of compute demand

Lots of data processing workloads don't need to be run constantly, but do need to be run in a shorter amount of time. Cloud is pretty good for that sort of thing.

weitendorf a year ago

Because you're not Amazon you also probably don't have tech as your core competency and don't have the budget to hire people skilled enough to operate an your on-prem setup as well as they operate their cloud.

Because you're not a startup you there is a very good chance that you have a very process-driven (cover your ass), slow-moving culture - this very often translates to an IT department where getting even basic things done (like reserving extra compute or changing a network setting or starting to use a third party software) takes months of waiting or pleading. Maybe you have never encountered this kind of pathological IT department, but they're very common, and it's a major reason executives bought into cloud to begin with. Of course, many companies like Geico seem to have merely replicated their IT pathologies in the cloud, but at least in the cloud you have fewer sources of problems in areas like physical space management, buying/integrating hardware to grow or change your footprint and dealing with all the SKUs and supply chain problems therein, or negotiating on-prem licences.

There are many more moving pieces when operating on-prem: more operations staff across more kinds of roles (yes, you still have eg devops people when using the cloud, but you don't need as many building operations staff (where managing a datacenter is its own speciality), people managing hardware/software vendors and related supply chain issues, people skilled in physical networking, people to plug things in/out and physically operate the machines), managing and acquiring the physical space where your on-prem setup is, buying/accounting for all the different kinds of hardware you need, licensing/using more software with more difficult integration to achieve equivalent functionality to eg EC2, licensing all your 3P software to run on-prem... even if nominally less expensive than the cloud in some cases, there are many more places where things can go wrong. That's not as easy to account for in a direct TCO comparison because it manifests as slowing things down - which does introduce very substantial costs - and distracting management away from other opportunities to grow revenue or improve costs.

Also, cloud downtime is really overstated as a problem in 2024. It makes the news because it has a high blast radius and involves high profile companies, not because it's more common than on-prem. With the exception of AWS us-east1 issues (which can break many AWS products at once across the world), most cloud reliability issues these days are isolated to only a few products and only a few regions. I think a lot of small on-prem companies don't realize that they are not actually more reliable, but just operate at a smaller scale where the probability of downtime causes "lucky streaks" to be more common (ie if you play roulette for three rounds, you're much more likely to have an abnormally high win rate than someone who plays it for three hundred rounds, even though you both have the same odds). Most companies don't have as mature security/risk operations as cloud providers and so face an existential risk/the possibility of huge (months) of downtime in the event of a fire/natural disaster at their dc, cryptolocker attack, janitor unplugging the server that says "do not unplug" - this isn't something people have to worry about with cloud providers to nearly the same extent.

beaviskhan a year ago

A company with the size and financial resources of Geico ought to be able to handle on-prem just fine. I am a huge public cloud fan, but it is definitely not a great (or even good) fit for everyone.

jnwatson a year ago

Cloud provides the CIO the same opportunities for advancement that COOs have had for years.

Staff costs too high? Outsource. Opex too high? Insource.

You can spend a career jumping among companies swinging the pendulum back and forth.

delusional a year ago

What a shame that the most interesting thing we can discuss about software now is where the computer its running on is located.

I must admit. The computer was never the part of software that interested me.

[-]

chronid a year ago

Even software (at least outside academia) eventually has to fight physics and the thing with most gravity of it all, money.

gtirloni a year ago

I'd gladly pay 2.5x more to not use OpenStack ever again.

stonethrowaway a year ago

> In an interview with The Stack she confirmed the shift, saying “we have a lot of data – and it turns out that storage in the cloud is one of the most expensive things you can do in the cloud, followed by AI in the cloud…”

This has been the story for 20 years now. Not even exaggerating. We all knew it was expensive from the get-go because we all did things on prem.

mullingitover a year ago

I feel like even in Geico's case, once they've paid salaries for everyone who's going to need to maintain this infra they're bringing in-house they're probably not saving that much. Then again, maybe they were already paying those salaries redundantly to all the services they were spending on, e.g. managed databases.

a year ago

[deleted]

a year ago

[deleted]

hnburnsy a year ago

Is building things cloud provider agnostic a thing? Is building things cloud or on prem agnostic a thing?

Tombradley2025 a year ago

[dead]