Railway Is Having a Major Outage

(status.railway.com)

82 points | by kgraves 3 hours ago ago

71 comments

  • fjni an hour ago

    Wait… railway runs on GCP? Didn’t they make a whole thing about not “building a cloud on top of another cloud?”

    Or did they just mean that they’re not renting VPSs but only metal from the cloud provider?

    In my mind I was so excited that there was another provider not just paying one of the hyperscalars but at a minimum colocating and owning more of their stack. https://blog.railway.com/p/heroku-walked-railway-run

    • miniman1337 an hour ago

      from the blog linked via Wayback Machine. "From Day 1, we had this notion at the forefront.

      The other notion that we have intuited is that you can’t build a cloud on another cloud. We have devoted years of practice running our own metal (and playing well with other clouds) to make sure that Railway’s business, which invariably becomes your customer’s business, is as rock solid as possible."

      • MrDarcy 38 minutes ago

        That’s strange, when I interviewed with the founder a few years ago he told me they were on AWS wanting to move to firecracker.

    • eoswald an hour ago

      Yep, and this is why I'm pissed. They lied. They're completely dependent on GCP. So, I gotta do some research, i need something a little more stable (and less dependent on one company's whims) than this. This is bad for them, because it really strikes at the heart of their 'big claim,' peacefull software deployments. This is chaos.

      • ndneighbor an hour ago

        Yea, I mean, that's the whole MO of our platform and we failed at that. So yea, that's disappointing and more so for our customers.

        I can provide an explanation about the GCP dependency. Yes, we have host workloads off GCP, and we have been able to build a good business by performing a cloud exit. However, we were worried that we would have a circular dependency on our own cloud. I don't think we expected to get auto-modded out of our own account, hence we left our DB on CloudSQL.

        It was never our intent to deceive people that we didn't own our own destiny with our business. The last GCP issue, we were assured that this scenario wouldn't happen (when we got auto-ratelimited, which was bad, but survivable) - but it seems like we have further work to do. Apologies.

        • fontain an hour ago

          I’m very sympathetic and understand that decisions are easy to criticize in hindsight but leaving your database in GCP while moving everything else to your own data centres seems so backwards I can’t even begin to imagine how that could happen. Was this really an intentional design decision?

          • arjie an hour ago

            I have exactly the same architecture. You can easily administer a postgres/mysql on your own infrastructure, but it's also the one thing where backups and availability are super strict. I can easily support multi-region in Google Cloud or AWS and that's way harder to do on-prem, and it's also hard to handle the replication story as safely as with Google Cloud. The hope is that GCP et al. give you safety and availability for the control plane stuff and you can run your data plane on-prem.

            At $2m/mo spend, this kind of thing is insane. GCP has never been the most reliable of clouds but this is pretty awful. I would never have expected this.

          • ndneighbor an hour ago

            > decisions are easy to criticize in hindsight

            I mean, the pain we have caused our customer ultimately proves you correct. That said, we made our decisions with the information and constraints that we knew in that moment in time. Railway has hosts in AWS/GCP/and co-los, so coordinating those workloads in a fully distributed manner would be ideal but end of the day, we didn't forsee that would just have our project get deleted just like that.

            (Even if we did get assurances from them in 2024, that it wouldn't happen again, although we just got auto-rate limited the last time.)

            • csw-001 19 minutes ago

              Thanks for getting things back up (genuinely mean that, btw). Upon logging back in I was prompted to promise I'm not deploying naughty things (I'm not). Was this in response to GCP detecting illegal (prohibited) behavior from something deployed via railway?

              • ndneighbor 4 minutes ago

                Actually, when I made the TOS check, I put that in Redis. That + the feature flags got reset.

            • r_lee 42 minutes ago

              could you clarify, did an automated process by Google delete a GCP project/account/resource(s)? like, what exactly were you seeing when trying to get access or see what happened?

              • ndneighbor 26 minutes ago

                They deleted our GCP proj. sans warning. Still working the details, but that's how this whole thing began.

          • yen223 18 minutes ago

            this is easily explained by "database migrations are incredibly difficult and very risky"

  • eoswald 2 hours ago

    Sorry, I have a hard time blaming Google for this, when Railway seems to be having increasing trouble keeping the platform stable. Something like this should NOT take down an ENTIRE service. There should be a backup when literally your business is about being the reliable backend. This just seems like poor planning to me.

    • ryanisnan 2 hours ago

      I don't quite know what you mean. Do you really expect Railway to use a multi-cloud architecture to host all of their client's projects? I suspect that would lead to a lower availability, all things considered.

      • kgeist 11 minutes ago

        Dunno, we have a multi-datacenter setup (bare metal with petabytes of data) and it's been great. The whole infrastructure is managed by a team of like ~5 (part-time, they have other responsibilities as well). Once the main instance is down, there's a script to switch everything to the standby instance in an other data center in like ~5 minutes. The routine is usually: first switch, and investigate/ask questions later. I'm surprised even large companies struggle with this, and you can see those status reports going "yeah, we're still down, still investigating" for hours. Sometimes the data center has networking issues. Sometimes our DB instance gets destabilized under load for some reason. Once there was a funny case when some technician in the data center misunderstood what we wanted from him and suddenly turned off our servers. It happens quite often if you run bare metal. If you have big enterprise clients, I don't understand how you can store all eggs in one basket and sleep well.

      • eoswald an hour ago

        Well, in the same token, is it smart to base your ENTIRE architecture on a single cloud architecture? Isn't that why some of us build in fallbacks for AWS-hosted services? I mean, their enitre platform, both public and private facing, is running on the same thing. One error, one problem, takes out the entire service.

      • impulser_ an hour ago

        They literally own their own data centers. That's whats surprising about this. They are lying to their customers when they say they operate their own data center because obviously they don't if everyone's apps are down with GCP blocking their account.

        • brookst an hour ago

          Is it not possible that they own their own data center and have an unfortunate Google dependency?

          Obviously a fiasco but I’m not prepared to call them liars when it could be an honest mistake.

          • impulser_ 23 minutes ago

            Then don't say your not a "Cloud on top of a cloud" provider.

            They even made fun of cloud providers being down when AWS was down.

          • Terr_ an hour ago

            I imagine there's also an important difference between:

            1. We depend on X but could gracefully migrate to an alternate in a week if we really needed to.

            2. All data is mirrored instantly so that we can do seamless fail-over in case X has its own outage.

        • ryanisnan an hour ago

          Oh, I see what you mean. Eh, it's possibly the same reason that AWS essentially goes down when us-east-1 goes down.

    • cactusplant7374 2 hours ago

      Disaster recovery is pretty expensive, right? Especially for their size.

  • whh an hour ago

    This could kill a startup. I really don't like Google's automated and silent account murder functionality.

    • MrDarcy 35 minutes ago

      There’s no way this was automated or silent.

      The only reasonable explanation is Railway lost control of their estate and something was happening that warranted a group of humans to decide flipping the kill switch was the best of a set of bad alternatives.

      • macintux 29 minutes ago

        You’re giving Google far more credit than they’ve earned.

  • Avicebron an hour ago

    Isn't Railway the "the API key to delete the backups is in the prod database, because that's where the backups live duh" guys?

  • faangguyindia 2 hours ago

    Google cloud also locked out a Korean Goverment Organization recently. The guy posted on GCP subreddit.

    Google really need to improve their support team. It's strange such a big corp can't even afford to have proper support team.

    • danpalmer an hour ago

      > It's strange such a big corp can't even afford to have proper support team

      Railway say they are in touch with that support team.

    • choilive 42 minutes ago

      Not strange, Google has never had a proper support team unless you are an "Enterprise" level customer.

    • benwoodward an hour ago

      pretty sure their support team is a flaky ML model that is haplessly flagging random accounts

    • King-Aaron 2 hours ago

      > It's strange such a big corp can't even afford to have proper support team

      This seems to be by design.

      • ndneighbor an hour ago

        We have a CSM, Head of Customer Support contact, and further contacts with GCP. Despite that, we still had this issue.

    • add-sub-mul-div an hour ago

      Automating support, automating everything is the key to their whole deal. Tech giants leapfrogged the rest of the economy by innovating a company that can scale its customers without having to scale itself proportionally.

  • enahs-sf an hour ago

    I respect what railway is doing but also would never run my business on such a platform.

    • eoswald an hour ago

      Today changed my opinion on them completely. Was willing to give them the benefit of the doubt that they're growing fast, but now seeing that they've failed to scale properly, and are missing little things that become big things later. I can't take that risk.

    • dpark an hour ago

      That kind of sounds like you don’t respect what they are doing.

  • TheTaytay an hour ago

    I’ve seen a few smug “all your eggs in one basket” comments here.

    I’m aware of some companies hosting their own metal and infra, but I’m not aware of large companies mitigating risk by hosting on separate cloud providers as a fallback mechanism. We might disagree with cloud provider choice, or think they should have been hosting their own metal, but that’s still an “all your eggs in one basket” choice, right?

    Heck, they might even have multi-region fallback with GCP, but if GCP bans your account, that doesn’t matter.

    Are there good examples of running a company of railway’s size so redundantly that their host could nuke one of their accounts and they’d just keep on trucking?

    • fontain an hour ago

      They do run their own metal. That’s their entire ethos. Railway is their own cloud.

    • chradams an hour ago

      Just google multi-cloud. Yes. It's a thing.

      • wmf 34 minutes ago

        99% of multi-cloud is fake though. True multi-cloud is incredibly rare.

  • Mengkudulangsat 2 hours ago

    That explains why all my vibe-coded hobby projects are down.

    Thank God I'm not dealing with any public-facing sites! Would have been an expensive lesson for a newbie coder if my job depended on this.

  • dwa3592 an hour ago

    Wait, I thought railway was a cloud provider like AWS, GCP but better and more agile. At least that's the impression i got from their website.

  • throwaranay4933 3 hours ago

    This screenshot from Discord suggests the idea that the outage is caused by automated GCP account ban: https://x.com/acgfbr/status/2056866780866351323

  • brokenodo 2 hours ago

    I’m a new customer and have been falling in love with Railway over the last 2 weeks, but this is quite the wake up call.

    • choilive 41 minutes ago

      Been a customer with them for over a year now, small incidents here and there but never anything this major.

    • csw-001 2 hours ago

      Literally in the same boat. I've been really happy with it, but this is a major eye opener.... It's been done for a looooong time by provider standards.

    • TheAtomic an hour ago

      same same

  • bshack0 an hour ago

    so....what are we switching to y'all? cloud-run ? ;P

    • auxiliarymoose an hour ago

      federated hardware (a bunch of raspberry pis networked into a high availability kubernetes cluster, hidden across various local coffee shops for free power and bandwidth)

    • throwatdem12311 an hour ago

      raspberry-pi cluster in my closet

      • frio 29 minutes ago

        16GiB Raspberry Pi 5s in my country are now going for ~$450USD, so I've gotta say that's out of reach for me now :(.

  • ryanisnan 2 hours ago

    Yikes. I was wondering why my TLS certs were coming up as invalid.

  • Drew-Aetherwave an hour ago

    It is killing me...

  • Osborn_Ojure 44 minutes ago

    compute recovered, get ready boys!

  • an hour ago
    [deleted]
  • mcontrerazCL 2 hours ago

    all my fkn postgres bd in railways! what do i do now?

    • eoswald an hour ago

      Hahah at least you're not getting called every five minutes because you cant shut off the alerts, because its apparently deployed SOMEWHERE but good luck finding how to access it. Can't wait to see the bill from Twilio because of this lol

    • cactusplant7374 2 hours ago

      Take a walk. Breathe in the fresh air. It feels good.

  • iloveplants 3 hours ago

    seems like it's every day

  • upnorthmedia 40 minutes ago

    [dead]

  • upnorthmedia 40 minutes ago

    [dead]

  • rekabis an hour ago

    TL;DR: putting all your eggs into one basket is bad, man.

    • lfx an hour ago

      That’s true, however having only few eggs and shopping for several baskets does not make sense in early days. Not sure how big railway is, but usually you start small with one egg.

      • christophilus an hour ago

        You’d think they wouldn’t have started with GCP. There are plenty of datacenters where you can buy racks and racks of servers, and talk to a human when something goes wrong, and even walk in and access your servers. That’s what I’d be using if I were to build a Rackspace today.

        • tomschlick an hour ago

          They started on GCP and have been migrating to their own "Metal" DC doing exactly what you're describing. But GCP is still their overflow given how rapidly they are growing and holds some amount of networking that routes to their DC.

        • wmf 30 minutes ago

          Colo is worse than cloud when you're getting started. Sure, you can talk to a person but everything else is much lower quality. People are obsessed with having someone to yell at but yelling does not fix outages.

  • bshack0 an hour ago

    so...what are we switching to yall? cloud-run :P