Personal experience between Fly.io and Railway.com - Railway wins for me hands down. I have used both and the Railways support is stellar too, in comparison. Fly.io never responded to my query about data deletion till date. Despite emailing on their support email.
I have had my Railway app online till date without any major downtimes too. I recommend anyone looking for a decent replacement to try them.
My fly.io-hosted website went down for 5 minutes (6 hours ago), but then came right back up, and has been up ever since. I use a free monitoring service that checks it every 5 minutes, so it's possible it missed another short bit of downtime. But fly.io has been pretty reliable overall for me!
Would be fascinated to see your data over a period of months.
Application up time is flakey, but what was worse were fly deploys failing for no clear reason. Sometimes layers would just hang and eventually fail for no particular reason; I'd run the same command an hour or two later without any changes and it would just work as expected.
I'd love to make a monitoring service to deploy a basic app (i.e. run the fly deploy command) every 5 minutes and see how often those deploys fail or hang. I'd guess ~5% inexplicably fail, which is frustrating unless you've got a lot of spare time.
My downtimes from fly are pretty rare but generally global when they happen, in this outage we had no downtime but couldn't deploy for a few hours. I have issues with deploying about once per quarter(deploy most days across a few apps)
If that’s the case I suspect fly is getting a lot more reliable. I stopped using them about a year ago so haven’t kept up on their reliability since. Glad to hear, it’s good for a competitive market to have many providers, and fly might have issues but hopefully has a bright future
They are definitely getting more reliable. I was an early user and moved off them to self hosted for quite a while because of the frequent downtime in early days.
Their support still leaves a lot to be desired even as someone that pays for it but the ease of running and deploying a distributed front end keeps bringing me back.
Suspiciously, Turso started having issues around the same time. Their CEO confirmed on Discord it's due to the Fly outage:
> Ok.I caught up with our oncall and This seems related to the Fly.io incident that is reported in our status page. Our login does call things in the Fly.io API
> we are already in touch with Fly and will see if we can speed this up
The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once.
Any principle in itself isn't without critique, agree, but it's still the choice being made to pick this specific principle that tells the whole story. There are so many principles to pick from and the tech dept pick follows up with a "We have a 3-month “no refactoring” rule for new hires. This isn’t everyone’s preferred work style! We try to be up front about stuff.", which sounds a bit like an additional perform or else... principle that just delays ownership of the stuff you're supposed to work with. In the best case that sounds like naiive optimism and in the worst case that's gross negligence... neither one speaks "engineering" to me.
No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation.
In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack.
That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever.
As a business on a budget, I think anything else i.e. a small civo cluster serves you better.
> a fly instance is hardwired to one physical server and thus cannot fail over
I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
You can run your workload (in this case a VM) on top of a scheduler, so if one node goes down the workload is just spun up on another available node.
> Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies.
Majority of EC2 instance types did not have live migration until very recently. Some probably still don't (they don't really spell out how and when it's supposed to work). It is also not free - there's a noticeable brown-out when your VM gets migrated on GCP for example.
Here's the GCP doc [1]. Other live migration products are similar.
Generally, you have worse performance while in the preparing to move state, an actual pause, then worse performance as the move finishes up. Depending on the networking setup, some inbound packets may be lost or delayed.
The status tells a story about a high-availability/clustering system failure so I think in this case the problem is rather the complexity of the HA machinery hurting the system's availability vs something like a simple VPS.
Seems like rolling their own datastore turned out to be a bad bet.
Im not super familiar with their constraints but scylladb can do eventual consistency and is generally quite flexible.
CouchDB is also an option for multi-leader replication.
Bad code rarely causes outages at this scale. The culprit is always configuration changes.
Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
You cannot plan your way out of operational challenges, regardless of what time of year it is.
> Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
Reading this, I see two routine operational issues, one security issue and one hardware issue.
You can’t plan you way around security issues or hardware failures, but operational issues you both can and should plan around. Holiday schedules like this are fixed points in time, so there’s absolutely no reason why you can’t plan all routine works to be completed either a week in advance, or a week after, the holiday period.
Certificates don’t need to be near the point of expiry to be renewed. Capacity doesn’t need to be at critical levels to be expanded. Ultimately, this is a risk management question (as a sibling has also commented). Is the organisation willing to take on increased risk in exchange for deferring operational expenses?
If the operational expense is inevitable (the certificate will need renewing), that seems like an easy answer when it comes to risk management over holidays.
If the operational expense is not inevitable (will we really need to expand capacity?), it then becomes a game of probabilities and financials - likelihood of expense being incurred, amount of expense incurred if done ahead of time, impact to business if something goes wrong during a holiday.
I think a good way of looking at it is risk. Is the change (whether it is code or configuration, etc.) worth the risk it brings on.
For example if it's a small feature then it probably makes sense to wait and keep things stable. But, if it's something that itself causes larger imminent danger like security patches / hard disk space constraints, then it's worth taking on the risk of change to mitigate the risk of not doing it.
At the end of the day no system is perfect and it ends up being judgement calls but I think viewing it as a risk tradeoff is helpful to understand.
I think you can't avoid the fact that these holiday weeks are different from regular weeks. If you "change freeze" then you also freeze out the little fixes and perf tuning that usually happens across these systems, because they're not "critical".
And then inevitably it turns out that there's a special marketing/product push, with special pricing logic that needs new code, and new UI widgets, causing a huge traffic/load surge, and it needs to go out NOW during the freeze, and this is revenue, so it is critical to the business leaders. Most of eng, and all of infra, didn't know about it, because the product team was cramming until the last minute, and it was kinda secret. So it turns out you can freeze the high-quality little fixes, but you can't really freeze the flaky brand-new features ...
It's just a struggle, and I still advise to forget the freeze, and try to be reasonable and not rush things (before, during, or after the freeze).
As a developer I don't see why I would rush out a change before the freeze when I could just wait until after. Maybe a stakeholder that really wants it would press for it to get out but personally I'd rather wait until after so I'm not fixing a bug during my holiday.
and stampeding changes in after the thaw, also leading to downtime. so it depends on the org, but doing a freeze is still reasonable policy. Downtime on December 15th is less expensive than on black Friday or cyber Monday for most retailers, so it's just a business decision at that point.
Blip? 365 has an ongoing incident since yesterday morning, european timezone. The reason I know is because I use their compliance tools to secure information in a rather large bankruptcy.
Certs shouldn't still be done by hand that this point; if another heartbleed comes out in the next 7 days then the risk can be examined, escalated, and the CISO can overrule the freeze. If it's a patch for remote root via Bluetooth drivers on a server that has no Bluetooth hardware, it's gonna wait.
you're right that there's a grey line, but crossing that line involves waking up several people and the on call person makes a judgement call. if it's not important enough to wake up several people over, then things stay frozen.
Right, that's basically what I mean. There are a lot of automated changes happening in the background for services. I guess the whole thing I'm saying is that not every breakage is happening because of a code change.
It's not very grey, prod becomes as if you told everyone but your ops team to go home and then sent your ops team on a cruise with pagers. If it's not important enough to merit interrupting their vacation you don't do it.
Yep...can confirm my self hosted Bitwarden there is completely FUBAR connection wise even if it is in EA, so it should be a worldwide outage...lemme guess, some internal tooling error, consensus split brain, or if it looks like someone leaked BGP routes again?
When I worked for a company who worked with big banks / financial institutions we used to run disaster recovery tests. Effectively a simulated outage where the company would try to run off their backup sites. They ran everything from those sites, it was impressive.
Once in a while we'd have a real outage that matched the test we ran as recently as the weekend before.
I was helping a bank switch over to the DR site(s) one day during such a real outage and I left my mic open when someone asked me what the commotion was on the upper floors of our HQ. I said "super happy fun surprise disaster recovery test for company X".
VP of BIG bank was on the line monitoring and laughed "I'm using that one on the executive call in 15, thanks!" Supposedly it got picked up at the bank internally after the VP made the joke and was an unofficial code for such an outage for a long time.
In most BIG banks, "Vice President" is almost an entry-level title. Easily have 1000s of them. For example, this article points out that Goldman Sachs had ~12K VPs out of more than 30K employees: https://web.archive.org/web/20150311012855/https://www.wsj.c...
VP at Goldman is equivalent to Senior SWE according to levels.fyi and their entry level is Analyst. I'm surprised by the compensation though. I would have thought people working at a place with gold in the name would be making more. Also apparently Morgan Stanley pays their VPs $67k/year.
In fairness to the fly.io folks (who are extremely serious hackers), they’re standing up a whole cloud provider and they’ve priced it attractively and they’re much customer-friendlier than most alternatives.
I don’t envy the difficulty of doing this, but I’m quite confident they’ll iron the bugs out.
I don’t always agree with @tptacek on social/political issues, and I don’t always agree with @xe on the direction of Nix, but these are legends on the technical side of things. And they’re trying to build an equitable relationship between the user of cloud services and the provider, not fund a private space program.
If I were in the market for cloud services I’d highly prize a long-term relationship on mutual benefit and fair dealings over a short-term nuisance of being an early adopter.
I strongly suspect your investment in fly is going to pay off.
I want to believe, but in the meantime they’re killing the product I’ve been working hard to build trust with my own customers though. There is a limit to my idealism, and it’s well and truly in the past.
No dog in this fight, all props to the Fly.io team for having the gumption to do what they are doing, I genuinely hope they are successful...
> It's still 99.99+% SLA
But this is simply not accurate. 99.99% uptime is < 52m 9.8s annually of downtime. They apparently blew well through that today. Looks like they essentially had the equivalent of 4 years of 99.99% uptime equivalent this evening.
Four nines is so unforgiving that it's almost the case that if people are required to be in the loop at any point during an incident, you will blow the fourth nine for the whole year in a single incident.
Again, I know it's hard. I would not want to be in the space. That fourth nine is really difficult to earn.
In the meanwhile, <hugops> to the Fly team as they work to resolve this (and hopefully get some rest).
99.99+% SLA typically means you get some billing credits for the downtime exceeding 99.99+ availability. So technically do get a "99.99+% SLA", but you don't get 99.99+% availability.
Other circles use "SLO" (where the O stands for objective).
I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
Examples include basically any PaaS, IaaS, or any company that provides a mission-critical service to another company (B2B SaaS).
If you run a basic B2C CRUD app, maybe it’s not a big deal if you service goes down for 5 minutes. Unfortunately there are quite a few categories of companies where downtime simply isn’t tolerated by customers. (I operate a company with a “zero downtime” expectation from customers - it’s no joke, and I would never use any infrastructure abstraction layer other than AWS, GCP or Azure - preferably AWS us-east-1 because, well, if you know the joke…)
All of your examples have had multiple cases of going down, some for multiple days (2011 AWS was the first really long one I think) - or potentially worse, just deleting all customer data permanently and irretrievably.
Meaning empirically, downtime seems to be tolerated by their customers up to some point?
> I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
I refuse to believe that this category still exists, when I need to keep my county's alternate number for 911 in my address book, because CenturyLink had a 6 hour outage in 2014 and a two day outage in 2018. If the phone company can't manage to keep 911 running anymore, I'd be very surprised what does have zero downtime over a ten year period.
Personally, nine nines is too hard, so I shoot for eight eights.
My experience with very large scale B2B SaaS and PaaS has been that customers like to get money, if allowed by contract, by complaining about outages, but that overall, B2B SaaS is actually very forgiving.
Most B2B SaaS solutions have very long sales cycles and a high total cost to implement, so there is a lot of inertia to switching that “a few annoying hours of downtime a year” isn’t going to cover. Also, the metric that will drive churn isn’t actually zero downtime, it’s “nearest competitor’s downtime,” which is usually a very different number.
Every PaaS and IaaS I’ve ever used has had some amount of downtime, often considerably more than 5 minutes, and I’ve run production services on many of them. Plenty of random issues on major cloud providers as well. Certainly plenty of situations with dozens of Twitter posts happening but never any acknowledgement on the AWS status page. Nothing’s perfect.
Yea, when running services where 5 minutes of downtime results in lots of support tickets, you learn to accept that the incident will happen and learn to manage the incident rather than relying that it will never occur.
you realize all of those services you mention can't give you zero downtime, they would never even advertise that. They have quite good reliability certainly, but on long enough time horizons absolutely no-one has zero downtime.
If your app cannot go down ever, then you cannot use a cloud provider either (because even AWS and Azure do fail sometime, just look up for “Azur down” on HN).
But the truth is everybody can afford some level of outage, simply because nobody has the budget to provision an infra that can never fail.
I’ve seen a team try and be truly “multi-cloud” but then ended up with this Frankenstein architecture where instead of being able to weather one cloud going down, their app would die if _any_ cloud had an issue. It was also surprisingly hard to convince people it doesn’t matter how many globally distributed clusters you have if all your data is in us-east.
I can’t even login to my old account. Password reset is timing out yet still receive password reset e-mail. Password reset link broken, with 500 status code.
It's an internal project based on Rust, not a product. So I don't think it matters too much what they name it. It's opens source which is great, but still not a product that they need to market.
I take your point but corrosion-resistant metals such as Aluminum, Titanium, Weathering Steel and Stainless Steel don’t avoid corrosion entirely but form a thin and extremely stable corrosion layer (under the right conditions).
I tried Fly early. I was very excited about this service, but I've never had a worse hosting experience. So I left. Coincidentally I tried it again a few days ago. Surely things must be better. Nope. Auth issues in the CLI, frustrations deploying a Docker app to a Fly machine. I wouldn't recommend it to anyone.
I find their user experience to be exceptional. The only flake I’ve encountered is in uptime and general reliability of services I don’t interface with directly. They’ve done a stellar job on the stuff you actually deal with, but the glue holding your services together seems pretty wobbly.
If you mean specifically flyio.net and not just fly.io the company, I'm guessing they host their status page on a separate domain in case of DNS/registrar issues with their primary domain.
IIRC their value prop is that they let you rapidly spin up deployments/machines in regions that are closest to your users, the idea being that it will be lower latency and thus better UX.
This is probably 5th or 6th major outage from Fly.io that I have personally seen. Pretty sure there were many others and some just went unnoticed. I recommended the service to a friend, and within two days he faced two outages.
Fly.io seriously needs to get it together. Why it hasn’t happened yet is a mystery to me. They have a good product but stability needs to be an absolute top for a hosting service. Everything else is secondary.
I get this but I think if people can give GitHub a pass for shitting the bed every two weeks maybe Fly should get a bit of goodwill here. I am not affiliated with Fly at all but I do think that people should temper their expectations when even mega corp can’t get it right
I guess the secret is to be the incumbent with no suitable replacement. Then you can be complete garbage in terms of reliability and everyone will just hand wave away your poor ops story
The biggest difference is GitHub in your infrastructure is (nearly always) internal. Fly in your infrastructure is external. Users generally don't see when you have issues with GitHub, but they do generally see when you have issues with Fly.
Who's giving GitHub a pass on shitting the bed? They go down often enough that if you don't have an internal git server setup for your CICD to hit, that's on you.
My point is made by your very post - getting off GitHub onto alternatives is not seriously discussed as an option - instead it’s “well, why didn’t you prepare better to deal with your vendor’s poor ops story”
It’s just not that big of a mystery. It’s not an excuse; it’s just true. Also, they’re not especially selling reliability as much as they’re selling small geo-distributed deployments.
They are fundamentally different. If Cloudflare provided a way to host docker containers with volumes though, that would be game over for so many paas platforms.
Color me not surprised. My few interactions with people there just gave off the impression of them being in a bit over their heads. I don't know how well that translated to their actual ops, but it's difficult to not connect the two when they continue to have major outage after major outage for a product that 'should' be their customer's bedrock upon which they build everything else.
Personal experience between Fly.io and Railway.com - Railway wins for me hands down. I have used both and the Railways support is stellar too, in comparison. Fly.io never responded to my query about data deletion till date. Despite emailing on their support email.
I have had my Railway app online till date without any major downtimes too. I recommend anyone looking for a decent replacement to try them.
How does it compare in terms of price?
My fly.io-hosted website went down for 5 minutes (6 hours ago), but then came right back up, and has been up ever since. I use a free monitoring service that checks it every 5 minutes, so it's possible it missed another short bit of downtime. But fly.io has been pretty reliable overall for me!
Would be fascinated to see your data over a period of months.
Application up time is flakey, but what was worse were fly deploys failing for no clear reason. Sometimes layers would just hang and eventually fail for no particular reason; I'd run the same command an hour or two later without any changes and it would just work as expected.
I'd love to make a monitoring service to deploy a basic app (i.e. run the fly deploy command) every 5 minutes and see how often those deploys fail or hang. I'd guess ~5% inexplicably fail, which is frustrating unless you've got a lot of spare time.
This may be of interest to you: https://news.ycombinator.com/item?id=42243282
My downtimes from fly are pretty rare but generally global when they happen, in this outage we had no downtime but couldn't deploy for a few hours. I have issues with deploying about once per quarter(deploy most days across a few apps)
If that’s the case I suspect fly is getting a lot more reliable. I stopped using them about a year ago so haven’t kept up on their reliability since. Glad to hear, it’s good for a competitive market to have many providers, and fly might have issues but hopefully has a bright future
They are definitely getting more reliable. I was an early user and moved off them to self hosted for quite a while because of the frequent downtime in early days.
Their support still leaves a lot to be desired even as someone that pays for it but the ease of running and deploying a distributed front end keeps bringing me back.
I externally monitor fly.io and it's docs here: https://flyio.onlineornot.com/
Looks like it lasted 16 minutes for them.
Do you mind if I ask what monitoring service that is?
Sure, it's UptimeRobot: https://uptimerobot.com/
Contrary to the title of the post, Fly.io API remains inaccessible. Meaning, users still cannot access deploys/databases, etc.
For accurate updates, follow https://community.fly.io/t/fly-io-site-is-currently-inaccess...
Suspiciously, Turso started having issues around the same time. Their CEO confirmed on Discord it's due to the Fly outage:
> Ok.I caught up with our oncall and This seems related to the Fly.io incident that is reported in our status page. Our login does call things in the Fly.io API
> we are already in touch with Fly and will see if we can speed this up
Not the first time Turso goes down because of Fly issues. It must suck to have built a db service and have this downtime.
Apparently Turso are going to offer an AWS tier at some point.
Last month Turso released AWS-hosted databases to the public (still in Beta): https://turso.tech/blog/turso-aws-beta
fly.io publishes their post-mortems here: https://fly.io/infra-log/
The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once.
On that Consul outage, Fly Infra concludes, "The moral of the story is, no more half-measures."
On their careers page [1], the Fly team goes, "We're not big believers in tech debt."
As an outsider, reads like a cacophony of contradictions?
[1] https://fly.io/docs/hiring/working/#we-re-ruthless-about-doi...
No one actually lives up to their principles, but it's still important that we have them.
If you actually do live up to yours, then you need to adopt better principles.
Any principle in itself isn't without critique, agree, but it's still the choice being made to pick this specific principle that tells the whole story. There are so many principles to pick from and the tech dept pick follows up with a "We have a 3-month “no refactoring” rule for new hires. This isn’t everyone’s preferred work style! We try to be up front about stuff.", which sounds a bit like an additional perform or else... principle that just delays ownership of the stuff you're supposed to work with. In the best case that sounds like naiive optimism and in the worst case that's gross negligence... neither one speaks "engineering" to me.
No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation.
In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack.
That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever.
As a business on a budget, I think anything else i.e. a small civo cluster serves you better.
Fly.io can migrate vm+volume now: https://fly.io/docs/reference/machine-migration/ / https://archive.md/rAK0V
> a fly instance is hardwired to one physical server and thus cannot fail over
I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
You can run your workload (in this case a VM) on top of a scheduler, so if one node goes down the workload is just spun up on another available node.
You will have downtime, but it will be limited.
> Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies.
Majority of EC2 instance types did not have live migration until very recently. Some probably still don't (they don't really spell out how and when it's supposed to work). It is also not free - there's a noticeable brown-out when your VM gets migrated on GCP for example.
Can you shed some more light on this "browning out" phenomenon?
Here's the GCP doc [1]. Other live migration products are similar.
Generally, you have worse performance while in the preparing to move state, an actual pause, then worse performance as the move finishes up. Depending on the networking setup, some inbound packets may be lost or delayed.
[1] https://cloud.google.com/compute/docs/instances/live-migrati...
If you want HA on Fly you need to deploy an app to multiple regions (multiple machines).
Fly might still go down completely if their proxy layer fails but it's much less common.
The status tells a story about a high-availability/clustering system failure so I think in this case the problem is rather the complexity of the HA machinery hurting the system's availability vs something like a simple VPS.
The series of outages early in 2023 also had some Corrosion-related pain: https://community.fly.io/t/reliability-its-not-great/11253
Seems like rolling their own datastore turned out to be a bad bet.
Im not super familiar with their constraints but scylladb can do eventual consistency and is generally quite flexible. CouchDB is also an option for multi-leader replication.
Oof, hugops to the team.
Recurring pattern I notice is outages tend to occur the week of major holidays in US.
- MS 365/Teams/Exchange had a blip in the morning
- Fly.io with complete outage
- then a handful of sites and services impacted due to those outages
Usually advocate against “change freezes” but I think a change freeze around major holidays makes sense. Give all teams a recharge/pause/whatever.
Don’t put too much pressure on the B-squads that were unfortunate to draw the short stick.
Bad code rarely causes outages at this scale. The culprit is always configuration changes.
Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
You cannot plan your way out of operational challenges, regardless of what time of year it is.
> Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
Reading this, I see two routine operational issues, one security issue and one hardware issue.
You can’t plan you way around security issues or hardware failures, but operational issues you both can and should plan around. Holiday schedules like this are fixed points in time, so there’s absolutely no reason why you can’t plan all routine works to be completed either a week in advance, or a week after, the holiday period.
Certificates don’t need to be near the point of expiry to be renewed. Capacity doesn’t need to be at critical levels to be expanded. Ultimately, this is a risk management question (as a sibling has also commented). Is the organisation willing to take on increased risk in exchange for deferring operational expenses?
If the operational expense is inevitable (the certificate will need renewing), that seems like an easy answer when it comes to risk management over holidays.
If the operational expense is not inevitable (will we really need to expand capacity?), it then becomes a game of probabilities and financials - likelihood of expense being incurred, amount of expense incurred if done ahead of time, impact to business if something goes wrong during a holiday.
I think a good way of looking at it is risk. Is the change (whether it is code or configuration, etc.) worth the risk it brings on.
For example if it's a small feature then it probably makes sense to wait and keep things stable. But, if it's something that itself causes larger imminent danger like security patches / hard disk space constraints, then it's worth taking on the risk of change to mitigate the risk of not doing it.
At the end of the day no system is perfect and it ends up being judgement calls but I think viewing it as a risk tradeoff is helpful to understand.
This is a good observation. Do you have any resources I can read up on to make this safer?
I think you can't avoid the fact that these holiday weeks are different from regular weeks. If you "change freeze" then you also freeze out the little fixes and perf tuning that usually happens across these systems, because they're not "critical".
And then inevitably it turns out that there's a special marketing/product push, with special pricing logic that needs new code, and new UI widgets, causing a huge traffic/load surge, and it needs to go out NOW during the freeze, and this is revenue, so it is critical to the business leaders. Most of eng, and all of infra, didn't know about it, because the product team was cramming until the last minute, and it was kinda secret. So it turns out you can freeze the high-quality little fixes, but you can't really freeze the flaky brand-new features ...
It's just a struggle, and I still advise to forget the freeze, and try to be reasonable and not rush things (before, during, or after the freeze).
Some shops conduct game days as the freeze approaches.
https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-2... / https://archive.md/uaJlR
Then you just get devs rushing out changes before the freeze…
As a developer I don't see why I would rush out a change before the freeze when I could just wait until after. Maybe a stakeholder that really wants it would press for it to get out but personally I'd rather wait until after so I'm not fixing a bug during my holiday.
Congrats on not working for the product team I work for
and stampeding changes in after the thaw, also leading to downtime. so it depends on the org, but doing a freeze is still reasonable policy. Downtime on December 15th is less expensive than on black Friday or cyber Monday for most retailers, so it's just a business decision at that point.
Blip? 365 has an ongoing incident since yesterday morning, european timezone. The reason I know is because I use their compliance tools to secure information in a rather large bankruptcy.
What do "Freezes" mean? Like, do you stop renewing your certificates? Do you stop taking in security updates for your software?
Sure maybe "unnecessary" changes, but the line gets very gray very fast.
No unnecessary code deployments.
Certs shouldn't still be done by hand that this point; if another heartbleed comes out in the next 7 days then the risk can be examined, escalated, and the CISO can overrule the freeze. If it's a patch for remote root via Bluetooth drivers on a server that has no Bluetooth hardware, it's gonna wait.
you're right that there's a grey line, but crossing that line involves waking up several people and the on call person makes a judgement call. if it's not important enough to wake up several people over, then things stay frozen.
Right, that's basically what I mean. There are a lot of automated changes happening in the background for services. I guess the whole thing I'm saying is that not every breakage is happening because of a code change.
It's not very grey, prod becomes as if you told everyone but your ops team to go home and then sent your ops team on a cruise with pagers. If it's not important enough to merit interrupting their vacation you don't do it.
Yep...can confirm my self hosted Bitwarden there is completely FUBAR connection wise even if it is in EA, so it should be a worldwide outage...lemme guess, some internal tooling error, consensus split brain, or if it looks like someone leaked BGP routes again?
It was a consensus split-brain (“database replication failure”) it seems
Mine is in Asia and it's still accessible.
DNS. It's always DNS. /s
https://github.com/jart/cosmopolitan/blob/master/third_party...
Might be! Shameless plug of a DNS tool i wrote years ago for anyone this pushes to learn more about DNS
https://dug.unfrl.com/
fly.io just has the weirdest outages. It has issues so regularly we dont even need to run mock outages to make sure our system fail overs work.
When I worked for a company who worked with big banks / financial institutions we used to run disaster recovery tests. Effectively a simulated outage where the company would try to run off their backup sites. They ran everything from those sites, it was impressive.
Once in a while we'd have a real outage that matched the test we ran as recently as the weekend before.
I was helping a bank switch over to the DR site(s) one day during such a real outage and I left my mic open when someone asked me what the commotion was on the upper floors of our HQ. I said "super happy fun surprise disaster recovery test for company X".
VP of BIG bank was on the line monitoring and laughed "I'm using that one on the executive call in 15, thanks!" Supposedly it got picked up at the bank internally after the VP made the joke and was an unofficial code for such an outage for a long time.
In most BIG banks, "Vice President" is almost an entry-level title. Easily have 1000s of them. For example, this article points out that Goldman Sachs had ~12K VPs out of more than 30K employees: https://web.archive.org/web/20150311012855/https://www.wsj.c...
VP at Goldman is equivalent to Senior SWE according to levels.fyi and their entry level is Analyst. I'm surprised by the compensation though. I would have thought people working at a place with gold in the name would be making more. Also apparently Morgan Stanley pays their VPs $67k/year.
Thankfully your comment was positive!
In fairness to the fly.io folks (who are extremely serious hackers), they’re standing up a whole cloud provider and they’ve priced it attractively and they’re much customer-friendlier than most alternatives.
I don’t envy the difficulty of doing this, but I’m quite confident they’ll iron the bugs out.
The tech is impressive and the pricing is attractive which is why we use them. I just wish there was less black magic.
I don’t always agree with @tptacek on social/political issues, and I don’t always agree with @xe on the direction of Nix, but these are legends on the technical side of things. And they’re trying to build an equitable relationship between the user of cloud services and the provider, not fund a private space program.
If I were in the market for cloud services I’d highly prize a long-term relationship on mutual benefit and fair dealings over a short-term nuisance of being an early adopter.
I strongly suspect your investment in fly is going to pay off.
Xe here. As a sibling comment said, I didn't survive layoffs. If you're looking for someone like me, I'm on the market!
Hiring people is above my pay grade, but I can vouch to my lords and masters and anyone else who cares what I think that a legend is up for grabs.
b7r6@b7r6.net
I'd email but I'm about to pass out in bed. Please see https://xeiaso.net/contact/ in case I don't get back to you in the morning.
FWIW Xe was let go from Fly earlier this year during a round of layoffs.
Unfortunate. Xe rocks.
I want to believe, but in the meantime they’re killing the product I’ve been working hard to build trust with my own customers though. There is a limit to my idealism, and it’s well and truly in the past.
It is not reflected in their status page, but fly.io itself is not even loading.
https://fly.io/ loading for me
Confirmation ;)
I'm grateful to HN for keeping me well aware of Fly's issues. I'll never use them.
It's still 99.99+% SLA? Would you really pay 100% more for <0.01% more uptime?
No dog in this fight, all props to the Fly.io team for having the gumption to do what they are doing, I genuinely hope they are successful...
> It's still 99.99+% SLA
But this is simply not accurate. 99.99% uptime is < 52m 9.8s annually of downtime. They apparently blew well through that today. Looks like they essentially had the equivalent of 4 years of 99.99% uptime equivalent this evening.
Four nines is so unforgiving that it's almost the case that if people are required to be in the loop at any point during an incident, you will blow the fourth nine for the whole year in a single incident.
Again, I know it's hard. I would not want to be in the space. That fourth nine is really difficult to earn.
In the meanwhile, <hugops> to the Fly team as they work to resolve this (and hopefully get some rest).
99.99+% SLA typically means you get some billing credits for the downtime exceeding 99.99+ availability. So technically do get a "99.99+% SLA", but you don't get 99.99+% availability.
Other circles use "SLO" (where the O stands for objective).
(Anyone know what the details in fly.io SLA are?)
You are correct in the legal/technical sense!
Technically, anyone could offer five- or six-nines and just depend on most customers not to claim the credits :-D
Actually hitting/exceeding four nines is still tough.
I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
Examples include basically any PaaS, IaaS, or any company that provides a mission-critical service to another company (B2B SaaS).
If you run a basic B2C CRUD app, maybe it’s not a big deal if you service goes down for 5 minutes. Unfortunately there are quite a few categories of companies where downtime simply isn’t tolerated by customers. (I operate a company with a “zero downtime” expectation from customers - it’s no joke, and I would never use any infrastructure abstraction layer other than AWS, GCP or Azure - preferably AWS us-east-1 because, well, if you know the joke…)
All of your examples have had multiple cases of going down, some for multiple days (2011 AWS was the first really long one I think) - or potentially worse, just deleting all customer data permanently and irretrievably.
Meaning empirically, downtime seems to be tolerated by their customers up to some point?
> I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
I refuse to believe that this category still exists, when I need to keep my county's alternate number for 911 in my address book, because CenturyLink had a 6 hour outage in 2014 and a two day outage in 2018. If the phone company can't manage to keep 911 running anymore, I'd be very surprised what does have zero downtime over a ten year period.
Personally, nine nines is too hard, so I shoot for eight eights.
My experience with very large scale B2B SaaS and PaaS has been that customers like to get money, if allowed by contract, by complaining about outages, but that overall, B2B SaaS is actually very forgiving.
Most B2B SaaS solutions have very long sales cycles and a high total cost to implement, so there is a lot of inertia to switching that “a few annoying hours of downtime a year” isn’t going to cover. Also, the metric that will drive churn isn’t actually zero downtime, it’s “nearest competitor’s downtime,” which is usually a very different number.
Every PaaS and IaaS I’ve ever used has had some amount of downtime, often considerably more than 5 minutes, and I’ve run production services on many of them. Plenty of random issues on major cloud providers as well. Certainly plenty of situations with dozens of Twitter posts happening but never any acknowledgement on the AWS status page. Nothing’s perfect.
Yea, when running services where 5 minutes of downtime results in lots of support tickets, you learn to accept that the incident will happen and learn to manage the incident rather than relying that it will never occur.
you realize all of those services you mention can't give you zero downtime, they would never even advertise that. They have quite good reliability certainly, but on long enough time horizons absolutely no-one has zero downtime.
If your app cannot go down ever, then you cannot use a cloud provider either (because even AWS and Azure do fail sometime, just look up for “Azur down” on HN).
But the truth is everybody can afford some level of outage, simply because nobody has the budget to provision an infra that can never fail.
I’ve seen a team try and be truly “multi-cloud” but then ended up with this Frankenstein architecture where instead of being able to weather one cloud going down, their app would die if _any_ cloud had an issue. It was also surprisingly hard to convince people it doesn’t matter how many globally distributed clusters you have if all your data is in us-east.
This is not my experience at all, as a former paying customer.
I can’t even login to my old account. Password reset is timing out yet still receive password reset e-mail. Password reset link broken, with 500 status code.
HUGOPS
Everything is going to be 200 OK!
My apps on Fly have not gone down this time.
Kinda funny that they've named their global state store "Corrosion"... not really a word I'd associate with stability and persistence.
It's an internal project based on Rust, not a product. So I don't think it matters too much what they name it. It's opens source which is great, but still not a product that they need to market.
And to be fair, it’s a bit of a cute meme to name rust projects things that relate to it. Oxide, etc
I stored important data in mnesia, so who would I be to talk. :p
amnesia means forget, so mnesia means remember, I would guess?
I take your point but corrosion-resistant metals such as Aluminum, Titanium, Weathering Steel and Stainless Steel don’t avoid corrosion entirely but form a thin and extremely stable corrosion layer (under the right conditions).
Gold and platinum really are corrosion resistant though (but have questionable mechanical properties…)
https://community.fly.io/t/reliability-its-not-great/11253
https://github.com/superfly/corrosion
I tried Fly early. I was very excited about this service, but I've never had a worse hosting experience. So I left. Coincidentally I tried it again a few days ago. Surely things must be better. Nope. Auth issues in the CLI, frustrations deploying a Docker app to a Fly machine. I wouldn't recommend it to anyone.
I find their user experience to be exceptional. The only flake I’ve encountered is in uptime and general reliability of services I don’t interface with directly. They’ve done a stellar job on the stuff you actually deal with, but the glue holding your services together seems pretty wobbly.
What exactly does flyio.net do?
If you mean specifically flyio.net and not just fly.io the company, I'm guessing they host their status page on a separate domain in case of DNS/registrar issues with their primary domain.
IIRC their value prop is that they let you rapidly spin up deployments/machines in regions that are closest to your users, the idea being that it will be lower latency and thus better UX.
It’s basically what Heroku used to be but with CDN-like presence.
Hosting service that has a lot of interesting distributed features.
WEB 2.0. SEE. TOLD YA! THEY SHOULDA UPGRADED TO THAT NEWFANGLED 3.0! ;)
This is probably 5th or 6th major outage from Fly.io that I have personally seen. Pretty sure there were many others and some just went unnoticed. I recommended the service to a friend, and within two days he faced two outages.
Fly.io seriously needs to get it together. Why it hasn’t happened yet is a mystery to me. They have a good product but stability needs to be an absolute top for a hosting service. Everything else is secondary.
I get this but I think if people can give GitHub a pass for shitting the bed every two weeks maybe Fly should get a bit of goodwill here. I am not affiliated with Fly at all but I do think that people should temper their expectations when even mega corp can’t get it right
I guess the secret is to be the incumbent with no suitable replacement. Then you can be complete garbage in terms of reliability and everyone will just hand wave away your poor ops story
The biggest difference is GitHub in your infrastructure is (nearly always) internal. Fly in your infrastructure is external. Users generally don't see when you have issues with GitHub, but they do generally see when you have issues with Fly.
That's the core difference.
Who's giving GitHub a pass on shitting the bed? They go down often enough that if you don't have an internal git server setup for your CICD to hit, that's on you.
My point is made by your very post - getting off GitHub onto alternatives is not seriously discussed as an option - instead it’s “well, why didn’t you prepare better to deal with your vendor’s poor ops story”
I wasn't going to bring up being on an internally hosted gitlab instead of github, but that would be the "not giving them a pass" part.
We left it about a year ago due to reliability issues. We now use digitalocean apps and working like a charm. Zero downtime with DO.
You mean their App Platform right? How does the pricing compare to fly?
Yes, App Platform. Pricing is a little higher but way lower than AWS but it is fully justified. Zero downtime in the last 1 year.
With Fly, we had 3-4 downtimes in 2023 in a span of 4 months.
Reliability is hard when your volume is (presumably) scaling geometrically.
Can't use the "reliability is hard" excuse when you are quite literally in the business of selling reliability.
It’s just not that big of a mystery. It’s not an excuse; it’s just true. Also, they’re not especially selling reliability as much as they’re selling small geo-distributed deployments.
Does anyone use them beyond the free tier? Same with Vercel for example.
Vercel has revenue of over $100M. So yes at least a few companies use them beyond the free tier.
Which company? GitHub? As far as I know fly.io does not have a free tier.
We switched from Fly to CF workers a while ago, and never looked back
They are fundamentally different. If Cloudflare provided a way to host docker containers with volumes though, that would be game over for so many paas platforms.
Can't wait: https://blog.cloudflare.com/container-platform-preview/
wow, this will be huge
I switched from apples to oranges and never looked back.
Our stuff on CF Workers has been working non stop for years now.
About 6 months ago we migrated our most critical stuff from Fly to CF and boy every time Fly has issues I'm so glad we did.
How are they equivalent?
congrats on not developing a playbook for the time you have to 'look back'.
Providers will fail. good contingencies won't.
...hears faint sound...I SAID GOOD, QUIET YOU!
Color me not surprised. My few interactions with people there just gave off the impression of them being in a bit over their heads. I don't know how well that translated to their actual ops, but it's difficult to not connect the two when they continue to have major outage after major outage for a product that 'should' be their customer's bedrock upon which they build everything else.
I was considering these guys the other day until I saw their pricing page: https://fly.io/pricing/
(There's not a single price on there, why even create the page?)
There's a link to what appears to be the actual pricing page https://fly.io/docs/about/pricing/
There's also a link to the pricing calculator https://fly.io/calculator
Is that calculator hourly or monthly?
Literally says "Monthly Costs" in the green panel on the right that calculates the total.
It's right there: "Monthly Cost"
The prices are just one click deeper. Hardly a nefarious dark pattern.