Geico repatriates work from the cloud, continues ambitious infra overhaul

(thestack.technology)

44 points | by us0r 16 hours ago ago

56 comments

  • geicosreyes 15 hours ago

    I've directly participated in this project and all I have to say is this: the same madness that created a super complex and unmanageable environment in the cloud is now in charge of creating a super easy and manageable environment on premises. The PoC had barely been approved and there was already legacy stuff in the new production environment.

    Geico's IT will slow to a crawl in the next years due to the immense madness of supporting Kubernetes on top of OpenStack on top of Kubernetes (yes, that's what they are doing).

    • stackskipton 15 hours ago

      This article read more like advertisement for VP spearheading all of this.

      • 9 hours ago
        [deleted]
    • sofixa 15 hours ago

      > Kubernetes on top of OpenStack on top of Kubernetes (yes, that's what they are doing).

      OpenStack's services are running in Kube? And Kube itself is ran as an OpenStack thing? Why? Why not use the same tooling used to deploy that initial Kube to deploy as many as needed? Still a massive maintenance burden, but you don't need to add OpenStack into the mix.

      • mrweasel 15 hours ago

        Because you can't necessarily run everything in Kubernetes, or in the same cluster. OpenStack probably provides VMs, private networks and bunch of other stuff to run legacy systems, 3rd. party software, Windows application, tons of stuff that can't be containerized.

        You can have a large Kubernetes cluster running OpenStack, because it's probably the easiest way to deploy and maintain OpenStack. You then build smaller, isolated Kubernetes clusters on top of OpenStack, using VMs.

        It's not as crazy as it sounds, but it does feel a little unnecessarily complex.

        • hamandcheese 14 hours ago

          I get why you might want to use open stack.

          And I get why you might want to use open stack on Kubernetes.

          What I don't get is why you would want Kubernetes on open stack on kubernetes.

          • mrweasel 13 hours ago

            One reason could be that you use Kubernetes as a deployment tool, but you don't actually need to full capacity of three bare metal servers. So you need to slice up the physical servers in some way, and Kubernetes can't do that.

            From experience most Kubernetes clusters are actually large enough, in terms of capacity required, to justify using an entire modern server and companies are very reluctant to run a mix if various application on the same cluster. There are very very few organisations large enough to need bare metal servers as Kubernetes worker nodes. Unless you use them to run OpenStack.

          • 13 hours ago
            [deleted]
          • fragmede 14 hours ago

            My money's on Conway's law. There's a hardware team that's in charge of the hardware, and they need to orchestrate ask the nodes, then there's the openstack team that's their customer who’s in charge of providing a cloud-like environment to the rest of the company including windows VMs, then there’s an applications kube team that provides kube for services that run on kube, with finally kube-ized applications teams that run on the very top.

      • derefr 14 hours ago

        From what I've seen in other projects, I think that translates to:

        1. we have a management k8s cluster where we deploy app blueprints

        2. the app blueprints contain, among other things, specifications for VMs to allocate, which get allocated through an OpenStack CRD controller

        3. and those VMs then get provisioned as k8s nodes, forming isolated k8s clusters (probably themselves exposed as resource manifests by the CRD controller on the management cluster);

        4. where those k8s nodes can then have "namespaced" (in the Linux kernel namespaces sense) k8s resource manifests bound to them

        5. which, through another CRD controller on the management cluster and a paired CRD agent controller on in the isolated cluster, causes equivalent regular resource manifests to be created in the isolated cluster

        6. ...which can then do whatever arbitrary things k8s resource manifests can do. (After all, these manifests might even include deployments of arbitrary other CRD controllers, for other manifests to rely upon.)

        All said, it's not actually that braindead of an architecture. You might better think of it as "k8s, with OpenStack serving as its 'Container Compute-Cluster Interface' driver for allocating new nodes/node pools for itself" (the same way that k8s has Container Storage Interface drivers.) Except that

        1. there isn't a "Container Compute-Cluster Interface" spec like the CSI spec, so this needs to be done ad-hoc right now; and

        2. k8s doesn't have a good multi-tenant security story — so rather than the k8s nodes created in these VMs being part of the cluster that spawned them, their resources isolated from the management-layer resources at a policy level, instead, the created nodes are formed into their own isolated clusters, with an isolated resource-set, and some kind of out-of-band resource replication and rewriting to allow for "passive" resources in the management cluster that control "active" resources in the sandboxed clusters.

    • lowbloodsugar 14 hours ago

      First you charge them to put a star on their belly, and then you can charge them to take the star off their belly!

    • RobRivera 14 hours ago

      All the whey down

      Dios mio mayne

    • JohnMakin 14 hours ago

      Thank you for posting this - reading this set off a lot of alarm bells, and there's a loud, growing "on prem" marketing movement that is likely to trumpet this as the downfall of "cloud" that I wasn't particularly looking forward to arguing with.

  • 0xbadcafebee 15 hours ago

    They had an expensive, fractured, hard to maintain on-prem layout. Then they moved to the cloud. And it turned out the cloud was expensive, fractured, and hard to maintain. So they're moving to on-prem.

    Any bets on what's going to happen next?

    • mmcconnell1618 15 hours ago

      The comment about "running legacy applications in the cloud was not any cheaper" stood out to me. Just moving the same legacy design into the cloud is not the optimal way to gain cost and availability improvements.

      If you have ever seen a data center from Azure, GCP or AWS, you will realize how difficult it will be for any company to compete in the long run. Those companies develop new generations of data center infrastructure with power efficiency improvements every single year. They negotiate network and power contracts at a scale that exceeds any typical Fortune 500 company. I'm skeptical that running your own data center will end up a cost saver in the long run.

      • kkielhofner 14 hours ago

        > They negotiate network and power contracts at a scale that exceeds any typical Fortune 500 company.

        ..and then mark it up. AWS overall has 38% operating margin[0]. Depending on your application this can hit you really hard (cloud egress bandwidth being an especially obscene offender).

        > I'm skeptical that running your own data center will end up a cost saver in the long run.

        It's not cloud -or- your own Azure-scale datacenter. There are any number of approaches in between including hybrid to offload stuff like CDN, storage, edge services, etc to cloud but the fact remains many companies can run the entire business from a few beefy machines in co-location facilities. Most companies, solutions, etc are not actually Google, Snapchat, Geico, etc scale and never will be.

        Throw in some minor accounting tricks like leasing (with or without Section 179) and these kinds of "creative" approaches are often impossible to beat from a pricing/performance and even uptime standpoint. That's certainly been my experience.

        [0] - https://www.theinformation.com/articles/why-aws-fat-margins-...

      • 9 hours ago
        [deleted]
      • HideousKojima 15 hours ago

        Colocation is always an option

    • wnevets 14 hours ago

      > Any bets on what's going to happen next?

      Someone in the c-suite gets a massive bonus before moving to a new company.

    • miyuru 15 hours ago

      according to the blog they started the cloud migration in 2013, there have been lot of improvements/changes to on-prem since then.

  • whatever1 16 hours ago

    If you don't have strong seasonality or not expecting a significant ramp up of compute demand (true for startups) why bother with the cloud?

    It is not more secure, I read every quarter about downtime events, and more importantly you have 0 control of your costs.

    Your company is likely not Amazon, you will do fine if you have your on prem computers.

    • oneplane 15 hours ago

      It's not really about cloud vs. on-prem, it's the fact that people cut corners and lack knowledge on-prem, and don't have the budgets to do anything about it.

      What you're referring to is mostly about elasticity, and it's true that if you don't need it, it doesn't make sense to pay for it.

      But that doesn't mean that on-prem (which almost always turns into a virtual machine shitshow with crappy network design -- which will continue as long as nobody implements things like strong IAM and Security Groups in their on-prem setups) is 'the same' as cloud but just in a physical location you control.

      The inverse is also true. If you just run some VMs 'in the cloud', you're doing it wrong. Playing datacenter is just as bad as not moving away from classic virtual machines, cloud or no cloud.

      • whatever1 15 hours ago

        So when they are setting up config files for the cloud they don't cut corners? It is insane amount of work to follow safe practices to configure your cloud.

        I don't see that much difference compared to doing actual admin tasks.

        • oneplane 15 hours ago

          The entire underlying layer of possible misconfigurations is absent in the cloud. Yes, the services on top of that can still be misconfigured, but you don't get access to hosts, SANs, switches, firewalls, gateways, there isn't anything for you to mess up. The shared responsibility model allows you to also pick even more robust options.

          But even if you were to stick to something simple, say, object storage. A bucket or blob store has no SAN config, no webserver config, no switches, no gateways, no raid controllers, no striping, mirroring, parity configuration, no firmware, no BIOS, no BMC, no OS. None of that. It's all eliminated. All that remains is the top layer where you configure your cost-to-resilience ratio and your access policy. And yes, you could cut corners, but those are orders of magnitude fewer corners you could be cutting than if you include all the stuff below it.

          Add to that: almost all of it has good APIs that are well defined, well supported and have an ecosystem to go with it. Try finding anything like that for a crappy NetApp or EMC appliance you find in a datacenter. It either doesn't exist, or it's so bad you might as well run MinIO or a bloody NFS share (not actual object storage) yourself.

          Being bad at cloud is definitely more expensive than being bad at on-prem, I'll give you that. But with cloud, at least you get a bill that you can use to show your peers and higher ups that being bad has a cost. Internal virtual/amortised dollars are much harder to allocate to incompetence. It's often completely ignored, and at best revisited at periodic capacity planning reviews with few to no consequences.

          The only place on-prem has, is with locality requirements. That includes latency sensitive things where sub 1ms is a goal, and air gapped things. But even in the first case things like an AWS Outpost exist, and those are cheaper than doing it yourself (not much, but enough to save on the hardware and on 2 FTEs).

          • whatever1 15 hours ago

            My friend some the biggest data leaks happened because of misconfigured S3 buckets which is literally one line of code to get right.

            Cloud is not an insurance against incompetence.

            • oneplane 15 hours ago

              I didn't mention there were no leaks or is no incompetence. I wrote about the amount of corners that are no longer available to be cut. Corner cutting isn't exclusive to data leaks. It impacts everything, mostly the people actually working on the stuff.

              Taking away responsibility from the people or departments that clearly can't handle it, that is what this means.

              It does not mean that the responsibility that remains suddenly does no longer end up with incompetent actors. It just means it is now smaller, and smaller to a degree where it is very much worth it in most cases.

              And just like I wrote earlier, there are cases where that works the other way around as well, and that just reinforces my point.

          • jjav 11 hours ago

            > The entire underlying layer of possible misconfigurations is absent in the cloud.

            This is true.

            Let's not forget there is a whole new, quite different, layer of potential (and easy) misconfigurations that exist only in the cloud, so it balances out.

            When you can accidentally expose services with a single mouse click where it used to take someone with access to the server room going in and grabbing a cable and wiring it wrong, this category of problem is a lot more common now.

      • mrweasel 14 hours ago

        That's really what some/most companies want, a platform that can run cheap, fast and easy VMs, like on-prem, but without the hassle of having to deal with the hardware and physical network part, like in the cloud. Sadly that's not the choice being offered.

        I don't know, I've seen the shittiest stuff built on-prem and in cloud, and I've seem completely amazing on-prem infrastructure and cloud stuff that could not possibly be built outside AWS.

    • bluGill 15 hours ago

      If your data center isn't large enough to need at least 5 people full time admins then you should just go cloud. With a part time person you will see downtime when a machine fails. With 1 person that person will sometimes be on vacation when a zero day takes you down. With 2 people 1 will be on vacation when the second gets sick. You end up needing at least 5 people before you have enough people that you have redundancy for humans issues and the ability to train people in whatever is the latest needed.

      Of course even in the cloud you still need to apply security patches to everything. However it still saves a lot of issues and thus money in all but the largest setups.

      • munk-a 15 hours ago

        Additionally, as someone who has been a part of the interview process for IT people, if you only have two people and you're not an expert yourself there's a non-neglible chance that neither of the two people you've got are particularly good at their job. I'd advise any company to just accept the premium cost of using cloud services rather than risk getting ransomewared or what-have-you and finding out nobody ever actually tested the backups.

        The cost of getting things wrong with on-prem aren't high on the average - but they sure are spikey if you get unlucky.

      • x0x0 15 hours ago

        > With a part time person you will see downtime when a machine fails.

        Many data centers offer remote hands services. And I don't believe this is at all true.

        I worked at a place that managed thousands of boxes in dozens of pops with 1.5 fulltime people. If you design it for this from the beginning, with cattle not pets and netboot everywhere, this is very doable. And a large cost savings vs cloud.

        • bluGill 14 hours ago

          The assertion was about bringing this onprem so you don't get that offer of remote hands service. A data center instead of onprem is a valid option and might be best - check the contract and services the provide for you carefully.

      • kkielhofner 15 hours ago

        > With a part time person you will see downtime when a machine fails

        If a hardware failure causes downtime you're doing it wrong. Additionally, big cloud scaring people from hardware with marketing and FUD has been very effective. Modern hardware is insanely reliable and performant - I don't think I've seen a datacenter/enterprise NVMe drive fail yet. It's not 2005 with spinning disks and power supplies blowing up left and right anymore.

        > With 1 person that person will sometimes be on vacation when a zero day takes you down. With 2 people 1 will be on vacation when the second gets sick. You end up needing at least 5 people before you have enough people that you have redundancy for humans issues and the ability to train people in whatever is the latest needed.

        Hardware vendors (Dell, etc) have highly-discounted warranty services. In the event of a hardware failure you open a ticket and they dispatch someone directly to the facility (often within hours by SLA) and it gets handled.

        Same thing for shipping HW directly to co-lo and they rack/cable/bootstrap for a nominal fee, remote hands for weird edge-cases, etc.

        A lot of takes here and elsewhere seem to be either big-cloud or Meta-level datacenter. I have operated POPs in a dozen co-location ("datacenter") facilities (a cabinet or two each) no one on staff ever stepped foot in with hardware we owned (and/or financed) that no one ever saw or touched. We operated this with two people looking after it as part of their broader roles and responsibilities and frankly they didn't have much to do.

        There is an entire industry that provides any number of highly flexible and cost-effective approaches for everything in between.

        • stackskipton 14 hours ago

          To me, the downside of on premise hardware isn't hardware swap out, it's just dealing with hardware in general. All hardware needs updates which is downtime for that hardware. Also, anyone in this industry long enough has been around for "Oh, we will just replace that broken piece of hardware" that ended up "WHY IS EVERYTHING ON FIRE?" because versions didn't match up, hardware was rejected or just plain "Actually, THAT failure mode isn't redundant."

          That can happen to Public Cloud as well but since they work with hardware at much much larger scale and most of time, build actual hardware software, they are much more aware of sharp edges.

          Finally, with Broadcom acquisition, what virtualization software are using and is it really cheaper then the cloud?

          • kkielhofner 14 hours ago

            > Also, anyone in this industry long enough has been around for "Oh, we will just replace that broken piece of hardware" that ended up "WHY IS EVERYTHING ON FIRE?" because versions didn't match up, hardware was rejected

            I've been doing this for 25 years and I'm not sure what this means. Dell isn't going to come back to you and say "sorry but we can't fix this". With the warranty SLA worst case scenario they'll just replace the entire machine if they have to although I don't remember ever seeing it come to that.

            > just plain "Actually, THAT failure mode isn't redundant."

            When it comes down to it similar issues exist with clouds - regions, availability zones, etc. Big clouds have had multiple widespread outages just this year[0].

            From that reference you can see that MS and Amazon themselves struggle to design, build, and run solutions for their own products in their own clouds.

            It's always interesting to see marquee household name companies/products/solutions go down when US-East (or whatever) is having a bad day again.

            Cloud can be a lot of things but a silver bullet for reliability and uptime isn't one of them.

            [0] - https://www.forbes.com/sites/emilsayegh/2024/07/31/microsoft...

            • stackskipton 13 hours ago

              >I've been doing this for 25 years and I'm not sure what this means. Dell isn't going to come back to you and say "sorry but we can't fix this".

              Dell/EMC says "Hey, here is drive replacement." We do it, 2 hours later, the volume is knocked offline. Apparently, there was mismatch between backplane version, drive version and through some weird edge case, it knocked the volume offline. Yes, they fixed it, no it wasn't pretty since a bunch of applications had to be recovered.

              No, public clouds are not 100% reliable either. It's just their failures tend to be you twiddling your thumbs vs hair on fire on phone with the vendor trying to get it resolved.

              • kkielhofner 10 hours ago

                > Dell/EMC says "Hey, here is drive replacement." We do it, 2 hours later, the volume is knocked offline. Apparently, there was mismatch between backplane version, drive version and through some weird edge case, it knocked the volume offline. Yes, they fixed it, no it wasn't pretty since a bunch of applications had to be recovered.

                Anecdotal (as is my position). I can theoretically understand this happening but not only have I never seen it, such an issue would need to be escalated. That's a "this is unacceptable" high-level phone call. A call you more than likely have a chance of someone in actual authority answering because IME unless you have SERIOUS spend with big cloud you'll be lucky to make it a rung or two up sales/support.

                Plus backups and redundancies that should prevent even the failure of a chassis/storage/etc from being a significant critical issue.

                > their failures tend to be you twiddling your thumbs vs hair on fire on phone with the vendor trying to get it resolved

                As a Founder/CTO I have the opposite take - put me and my team in a position to /do something/ vs sitting around waiting for AWS to come back whenever it decides to and while they obscure comms, don't update the fake status dashboards, etc. Meanwhile you're telling your customer "Ummm, we don't know - Amazon has a problem. When it comes back I guess it's back".

                Coming from a background of telecom, healthcare, and nuclear energy I can't believe that even flies.

    • 0xbadcafebee 8 hours ago

      Our company is literally in the 3rd week of waiting for a colo to install some new RAM modules in a server. Before that we were waiting two weeks to get a new server ordered, delivered and racked. Before that we had to wait a week for them to tell us if there was available power and network ports for the new server.

      That server is the main database. And yes, there is a backup server, but for reasons, the backup server isn't working as expected. So if that main server's RAM failed for good, there goes our product, for god knows how long, considering how long it's taken so far to get a second one set up.

      You don't have to deal with any of that shit in the cloud. None. You just spin up a new server in 2 seconds. You don't deal with shitty hardware, or the differences between old and new hardware (besides cpu arch, and some special classes), or incompatibilities, or running out of space, or getting smart hands in your rack, or a million other things.

      And that's just the hardware side. The software side of the cloud is the one million unique hosted services they offer that you can just start using immediately. No server set-up, no configuration management, it already has security baked in, it's already integrated with the other million services, etc. You just start using it, immediately, and it just works. It saves you time, complexity, maintenance, and it gives you reliability, compatibility, flexibility, and allows you to ship something earlier.

      I have managed servers on-prem for years, for tiny startups and huge companies. Both two decades ago, and two years ago. Without a doubt, I would always suggest any kind of hosted, cloud-style vendor over on-prem. Only somebody needs to be on-prem, or they literally are a teenager with no money at all and all the time in the world to waste DIYing, then I would tell them to go on-prem.

    • milesward 16 hours ago

      Find me a list of customers on cloud who got hacked, vs folks on-prem. I've got 3k+ customers, I know which one I see 99.99% of the time...

      • whatever1 15 hours ago

        I guess you don't count misconfigurations. But deciding between the cloud vs local is a choice between config or admin.

    • alexjplant 14 hours ago

      Disclaimer: this is anecdotal so n=1. All opinions are my own. No value judgment one way or another is expressed or implied.

      Professional developers these days are primarily concerned with 1) getting their service running 2) as quickly as possible 3) someplace where they have instant access and control of it. Clicking around a cloud console accomplishes all three of these and allows you to write "Delivered the ____ service in 3 months that generates $XX M/year" on a performance review in short order. Having to build, rack, and configure a physical server or deal with "IT" (which has somehow become something separate from software engineering) does not. Because the developers are the ones delivering value they get to decide how it's done. AWS gets it done. A server in a datacenter in Texas that requires an SSH keypair to reach doesn't.

      Your average SDE L4 does know or care about init systems or SANs or colos or 802.1q or any of the myriad of things required to run on-prem infra. They write software. Software makes money and so the business makes money - wash, rinse, repeat. Why would you have people on the front lines of your revenue stream worrying about these things when you can have a hyperscaler with a control plane do it for a nominal fee?

      • whatever1 13 hours ago

        If the hyperscaler asks for 200% of my revenue then yes.

        • alexjplant 13 hours ago

          But they don't. They ask for a deterministic usage-based amount.

    • weitendorf 14 hours ago

      Because you're not Amazon you also probably don't have tech as your core competency and don't have the budget to hire people skilled enough to operate an your on-prem setup as well as they operate their cloud.

      Because you're not a startup you there is a very good chance that you have a very process-driven (cover your ass), slow-moving culture - this very often translates to an IT department where getting even basic things done (like reserving extra compute or changing a network setting or starting to use a third party software) takes months of waiting or pleading. Maybe you have never encountered this kind of pathological IT department, but they're very common, and it's a major reason executives bought into cloud to begin with. Of course, many companies like Geico seem to have merely replicated their IT pathologies in the cloud, but at least in the cloud you have fewer sources of problems in areas like physical space management, buying/integrating hardware to grow or change your footprint and dealing with all the SKUs and supply chain problems therein, or negotiating on-prem licences.

      There are many more moving pieces when operating on-prem: more operations staff across more kinds of roles (yes, you still have eg devops people when using the cloud, but you don't need as many building operations staff (where managing a datacenter is its own speciality), people managing hardware/software vendors and related supply chain issues, people skilled in physical networking, people to plug things in/out and physically operate the machines), managing and acquiring the physical space where your on-prem setup is, buying/accounting for all the different kinds of hardware you need, licensing/using more software with more difficult integration to achieve equivalent functionality to eg EC2, licensing all your 3P software to run on-prem... even if nominally less expensive than the cloud in some cases, there are many more places where things can go wrong. That's not as easy to account for in a direct TCO comparison because it manifests as slowing things down - which does introduce very substantial costs - and distracting management away from other opportunities to grow revenue or improve costs.

      Also, cloud downtime is really overstated as a problem in 2024. It makes the news because it has a high blast radius and involves high profile companies, not because it's more common than on-prem. With the exception of AWS us-east1 issues (which can break many AWS products at once across the world), most cloud reliability issues these days are isolated to only a few products and only a few regions. I think a lot of small on-prem companies don't realize that they are not actually more reliable, but just operate at a smaller scale where the probability of downtime causes "lucky streaks" to be more common (ie if you play roulette for three rounds, you're much more likely to have an abnormally high win rate than someone who plays it for three hundred rounds, even though you both have the same odds). Most companies don't have as mature security/risk operations as cloud providers and so face an existential risk/the possibility of huge (months) of downtime in the event of a fire/natural disaster at their dc, cryptolocker attack, janitor unplugging the server that says "do not unplug" - this isn't something people have to worry about with cloud providers to nearly the same extent.

    • VirusNewbie 15 hours ago

      > expecting a significant ramp up of compute demand

      Lots of data processing workloads don't need to be run constantly, but do need to be run in a shorter amount of time. Cloud is pretty good for that sort of thing.

  • beaviskhan 15 hours ago

    A company with the size and financial resources of Geico ought to be able to handle on-prem just fine. I am a huge public cloud fan, but it is definitely not a great (or even good) fit for everyone.

  • jnwatson 15 hours ago

    Cloud provides the CIO the same opportunities for advancement that COOs have had for years.

    Staff costs too high? Outsource. Opex too high? Insource.

    You can spend a career jumping among companies swinging the pendulum back and forth.

  • gtirloni 15 hours ago

    I'd gladly pay 2.5x more to not use OpenStack ever again.

  • mullingitover 15 hours ago

    I feel like even in Geico's case, once they've paid salaries for everyone who's going to need to maintain this infra they're bringing in-house they're probably not saving that much. Then again, maybe they were already paying those salaries redundantly to all the services they were spending on, e.g. managed databases.

  • 14 hours ago
    [deleted]
  • hnburnsy 15 hours ago

    Is building things cloud provider agnostic a thing? Is building things cloud or on prem agnostic a thing?

  • delusional 14 hours ago

    What a shame that the most interesting thing we can discuss about software now is where the computer its running on is located.

    I must admit. The computer was never the part of software that interested me.

    • chronid 13 hours ago

      Even software (at least outside academia) eventually has to fight physics and the thing with most gravity of it all, money.

  • 10 hours ago
    [deleted]
  • stonethrowaway 14 hours ago

    > In an interview with The Stack she confirmed the shift, saying “we have a lot of data – and it turns out that storage in the cloud is one of the most expensive things you can do in the cloud, followed by AI in the cloud…”

    This has been the story for 20 years now. Not even exaggerating. We all knew it was expensive from the get-go because we all did things on prem.

  • Tombradley2025 12 hours ago

    [dead]