84 comments

  • xyst 9 hours ago

    Underlying tech is “Openpubkey”.

    https://github.com/openpubkey/openpubkey

    BastionZero just builds on top of that to provide a “seamless” UX for ssh sessions and some auditing/fedramp certification.

    Personally, not a fan of relying on CF. Need less centralization/consolidation into a few companies. It’s bad enough with MS dominating the OS (consumer) space. AWS dominating cloud computing. And CF filling the gaps between the stack.

  • edelbitter 10 hours ago

    Why does the title say "Zero Trust", when the article explains that this only works as long as every involved component of the Cloudflare MitM keylogger and its CA can be trusted? If hosts keys are worthless because you do not know in advance what key the proxy will have.. than this scheme is back to trusting servers merely because they are in Cloudflare address space, no?

    • hedora 6 hours ago

      Every zero trust architecture ends up trusting an unbounded set of machines. Like most marketing terms, it’s probably easier to assume it does the inverse of what it claims.

      My mental model:

      With 1 trust (the default) any trusted machine with credentials is provided access and therefore gets one unit of access. With 2-trust, we’d need at least units of trust, so two machines. Equivalently, each credential-bearing machine is half trusted (think ssh bastion hosts or 2FA / mobikeys for 2 trust).

      This generalizes to 1/N, so for zero trust, we place 1/0 = infinite units of trust in every machine that has a credential. In other words, if we provision any one machine for access, we necessarily provision an unbounded number of other machines for the same level of access.

      As snarky as this math is, I’ve yet to see a more accurate formulation of what zero trust architectures actually provide.

      YMmV.

      • choeger 3 hours ago

        I think your model is absolutely right. But there's a catch: Zero Trust (TM) is about not giving any machine any particular kind of access. So it's an infinite amount of machines with zero access.

        The point of Zero Trust (TM) is to authenticate and authorize the human being behind the machine, not the machine itself.

        (Clearly, that doesn't work for all kinds of automated access and it comes with a lot of question in terms of implementation details (E.g., do we trust the 2FA device?) but that's the gist.)

    • varenc 8 hours ago

      https://www.cloudflare.com/learning/security/glossary/what-i...

      Zero Trust just means you stop inherently trusting your private network and verify every user/device/request regardless. If you opt in to using Cloudflare to do this then it requires running Cloudflare software.

      • PLG88 8 hours ago

        Thats one interpretation... ZT also posits assuming the network is compromised and hostile, that also applies to CF and their cloud/network. It blows my mind that so many solutions claim ZT while mandating TLS to their infra/cloud, you can trust their decryption of your date, and worst IMHO, they will MITM your OICD/SAML key to ensure the endpoint can authenticate and access services... that is a hell of a lot of implicit trust in them, not least of them being served a court order to decrypt your data.

        Zero trust done correctly done not have those same drawbacks.

        • sshine 7 hours ago

          One element is buzzword inflation, and another is raising the bar.

          On the one hand, entirely trusting Cloudflare isn't really zero trust.

          On the other hand, not trusting any network is one narrow definition.

          I'll give you SSH keys when you pry them from my cold, dead FDE SSDs.

      • bdd8f1df777b 8 hours ago

        But with public key auth I'm already distrusting everyone on my private network.

        • resoluteteeth 8 hours ago

          Technically I guess that's "zero trust" in the sense of meeting the requirement of not trusting internal connections more than external ones, but in practice I guess "zero trust" also typically entails making every connection go through the same user-based authentication system, which uploading specific keys to specific servers manually definitely doesn't achieve.

    • ozim 3 hours ago

      “Zero Trust” is not assuming user has access or he is somehow trusted just because he is in trusted context. So you always check users access rights.

      TLS having trusted CA cert publisher is not context of “Zero Trust”.

  • mdaniel 11 hours ago

    I really enjoyed my time with Vault's ssh-ca (back when it had a sane license) but have now grown up and believe that any ssh access is an antipattern. For context, I'm also one of those "immutable OS or GTFO" chaps because in my experience the next thing that happens after some rando ssh-es into a machine is they launch vi or apt-get or whatever and now it's a snowflake with zero auditing of the actions taken to it

    I don't mean to detract from this, because short-lived creds are always better, but for my money I hope I never have sshd running on any machine again

    • akira2501 4 hours ago

      > any ssh access is an antipattern.

      Not generally. In one particular class of deployments allowing ssh access to root enabled accounts without auditing may be.. but this is an exceptionally narrowed definition.

      > I hope I never have sshd running on any machine again

      Sounds great for production and ridiculous for development and testing.

    • advael 6 hours ago

      Principle of least privilege trivially prevents updating system packages. Like if you don't want people using apt, don't give people root on your servers?

    • ozim 10 hours ago

      How do you handle db.

      Stuff I work on is write heavy so spawning dozens of app copies doesn’t make sense if I just hog the db with Erie locks.

      • mdaniel 10 hours ago

        I must resist the urge to write "users can access the DB via the APIs in front of it" :-D

        But, seriously, Teleport (back before they did a licensing rug-pull) is great at that and no SSH required. I'm super positive there are a bazillion other "don't use ssh as a poor person's VPN" solutions

        • zavec 9 hours ago

          This led me to google "teleport license," which sounds like a search from a much more interesting world.

    • ashconnor 6 hours ago

      You can audit if you put something like hoop.dev, Tailscale, Teleport or Boundary in between the client and server.

      Disclaimer: I work at Hashicorp.

    • namxam 11 hours ago

      But what is the alternative?

      • mdaniel 10 hours ago

        There's not one answer to your question, but here's mine: kubelet and AWS SSM (which, to the best of my knowledge will work on non-AWS infra it just needs to be provided creds). Bottlerocket <https://github.com/bottlerocket-os/bottlerocket#setup> comes batteries included with both of those things, and is cheaply provisioned with (ahem) TOML user-data <https://github.com/bottlerocket-os/bottlerocket#description-...>

        In that specific case, one can also have "systemd for normal people" via its support for static Pod definitions, so one can run containerized toys on boot even without being a formal member of a kubernetes cluster

        AWS SSM provides auditing of what a person might normally type via ssh, and kubelet similarly, just at a different abstraction level. For clarity, I am aware that it's possible via some sshd trickery one could get similar audit and log egress, but I haven't seen one of those in practice whereas kubelet and AWS SSM provide it out of the box

      • ndndjdueej 7 hours ago

        IaC, send out logs to Splunk, health checks, slow rollouts, feature flags etc?

        Allow SSH in non prod environments and reproduce issue there?

        In prod you are aiming for "not broken" rather than "do whatever I want as admin".

      • candiddevmike 8 hours ago

        I built a config management tool, Etcha, that uses short lived JWTs. I extended it to offer a full shell over HTTP using JWTs:

        https://etcha.dev/docs/guides/shell-access/

        It works well and I can "expose" servers using reverse proxies since the entire shell session is over HTTP using SSE.

        • g-b-r 7 hours ago

          “All JWTs are sent with low expirations (5 seconds) to limit replability”

          Do you know how many times a few packets can be replayed in 5 seconds?

          • candiddevmike 7 hours ago

            Sure, but this is all happening over HTTPS (Etcha only listens on HTTPS), it's just an added form of protection/expiration.

        • artificialLimbs 5 hours ago

          I don’t understand why this is more secure than limiting SSH to local network only and doing ‘normal’ ssh hardening.

    • riddley 11 hours ago

      How do you troubleshoot?

      • bigiain 10 hours ago

        I think ssh-ing into production is a sign of not fully mature devops practices.

        We are still stuck there, but we're striving to get to the place where we can turn off sshd on Prod and rely on the CI/CD pipeline to blow away and reprovision instances, and be 100% confident we can test and troubleshoot in dev and stage and by looking at off-instance logs from Prod.

        How important it is to get there is something I ponder about my motivations for - it's cleary not worthwhile if your project is one or 2 prod servers perhaps running something like HA WordPress, but it's obvious that at Netflix type scale that nobody is sshing into individual instances to troubleshoot. We are a long way (a long long long long way) from Netflix scale, and are unlikely to ever get there. But somewhere between dozens and hundreds of instances is about where I reckon the work required to get close to there stars paying off.

        • xorcist 40 minutes ago

          > at Netflix type scale that nobody is sshing into individual instances to troubleshoot

          Have you worked at Netflix?

          I haven't, but I have worked with large scale operations, and I wouldn't hesitate to say that the ability to ssh (or other ways to run commands remotely, which are all either built on ssh or likely not as secure and well tested) is absolutely crucial to running at scale.

          The more complex and non-heterogenous environments you have, the more likely you are to encounter strange flukes. Handshakes that only fail a fraction of a percent of all times and so on. Multiple products and providers interaction. Tools like tcpdump and eBPF becomes essential.

          Why would you want to deploy on a mature operating system such as Linux and not use tools such as eBPF? I know the modern way is just to yolo it and restart stuff that crashes, but as a startup or small scale you have other things to worry about. When you are at scale you really want to understand your performance profile and iron out all the kinks.

        • imiric 10 hours ago

          Right. The answer is having systems that are resilient to failure, and if they do fail being able to quickly replace any node, hopefully automatically, along with solid observability to give you insight into what failed and how to fix it. The process of logging into a machine to troubleshoot it in real-time while the system is on fire is so antiquated, not to mention stressful. On-call shouldn't really be a major part of our industry. Systems should be self-healing, and troubleshooting done during working hours.

          Achieving this is difficult, but we have the tools to do it. The hurdles are often organizational rather than technical.

          • bigiain 9 hours ago

            > The hurdles are often organizational rather than technical.

            Yeah. And in my opinion "organizational" reasons can (and should) include "we are just not at the scale where achieving that makes sense".

            If you have single digit numbers of machines, the whole solid observability/ automated node replacement/self-healing setup overhead is unlikely to pay off. Especially if the SLAs don't require 2am weekend hair-on-fire platform recovery. For a _lot_ things, you can almost completely avoid on-call incidents with straightforward redundant (over provisioned) HA architectures, no single points of failure, and sensible office hours only deployment rules (and never _ever_ deploy to Prod on a Friday afternoon).

            Scrappy startups, and web/mobile platforms for anything where a few hours of downtime is not going to be an existential threat to the money flow or a big story in the tech press - probably have more important things to be doing than setting up log aggregation and request tracing. Work towards that, sure, but probably prioritise the dev productivity parts first. Get your CI/CD pipeline rock solid. Get some decent monitoring of the redundant components of your HA setup (as well as the Prod load balancer monitoring) so you know when you're degraded but not down (giving you some breathing space to troubleshoot).

            And aspire to fully resilient systems and have a plan for what they might look like in the future to avoid painting yourself into a corner that makes it harder then necessary to get there one day.

            But if you've got a guy spending 6 months setting up chaos monkey and chaos doctor for your WordPress site that's only getting a few thousand visits a day, you're definitely going it wrong. Five nines are expensive. If your users are gonna be "happy enough" with three nines or even two nines, you've probably got way better things to do with that budget.

            • Aeolun 8 hours ago

              > For a _lot_ things, you can almost completely avoid on-call incidents with straightforward redundant (over provisioned) HA architectures, no single points of failure, and sensible office hours only deployment rules (and never _ever_ deploy to Prod on a Friday afternoon).

              For a lot of things the lack of complexity inherent in a single VPS server will mean you have better availability than any of those bizarrely complex autoscaling/recovery setups

            • imiric 2 hours ago

              I'm not so sure about all of that.

              The thing is that all companies regardless of their scale would benefit from these good practices. Scrappy startups definitely have more important things to do than maintaining their infra, whether that involves setting up observability and automation or manually troubleshooting and deploying. Both involve resources and trade-offs, but one of them eventually leads to a reduction of required resources and stability/reliability improvements, while the other leads to a hole of technical debt that is difficult to get out of if you ever want to improve stability/reliability.

              What I find more harmful is the prevailing notion that "complexity" must be avoided at smaller scales, and that somehow copying a binary to a single VPS is the correct way to deploy at this stage. You see this in the sibling comment from Aeolun here.

              The reality is that doing all of this right is an inherently complex problem. There's no getting around that. It's true that at smaller scales some of these practices can be ignored, and determining which is a skill on its own. But what usually happens is that companies build their own hodgepodge solutions to these problems as they run into them, which accumulate over time, and they end up having to maintain their Rube Goldberg machines in perpetuity because of sunk costs. This means that they never achieve the benefits they would have had they just adopted good practices and tooling from the start.

              I'm not saying that starting with k8s and such is always a good idea, especially if the company is not well established yet, but we have tools and services nowadays that handle these problems for us. Shunning cloud providers, containers, k8s, or any other technology out of an irrational fear of complexity is more harmful than beneficial.

        • otabdeveloper4 2 hours ago

          A whole lot of words to say "we don't troubleshoot and just live with bugs, #yolo".

      • mdaniel 10 hours ago

        In my world, if a developer needs access to the Node upon which their app is deployed to troubleshoot, that's 100% a bug in their application. I am cognizant that being whole-hog on 12 Factor apps is a journey, but for my money get on the train because "let me just ssh in and edit this one config file" is the road to ruin when no one knows who edited what to set it to what new value. Running $(kubectl edit) allows $(kubectl rollout undo) to put it back, and also shows what was changed from what to what

        • yjftsjthsd-h 10 hours ago

          How do you debug the worker itself?

          • mdaniel 10 hours ago

            Separate from my sibling comment about AWS SSM, I also believe that if one cannot know that a Node is sick by the metrics or log egress from it, that's a deployment bug. I'm firmly in the "Cattle" camp, and am getting closer and closer to the "Reverse Uptime" camp - made easier by ASG's newfound "Instance Lifespan" setting to make it basically one-click to get onboard that train

            Even as I type all these answers out, I'm super cognizant that there's not one hammer for all nails, and I am for sure guilty of yanking Nodes out of the ASG in order to figure out what the hell has gone wrong with them, but I try very very hard not to place my Nodes in a precarious situation to begin with so that such extreme troubleshooting becomes a minor severity incident and not Situation Normal

            • __turbobrew__ 7 hours ago

              If accidentally nuking a single node while debugging causes issues you have bigger problems. Especially if you are running kubernetes any node should be able to fall off the earth at any time without issues.

              I agree that you should set a maximum lifetime for a node on the order of a few weeks.

              I also agree that you shouldn’t be giving randos access to production infra, but and the end of the day there needs to be some people at the company who have the keys to the kingdom because you don’t know what you don’t know and you need to be able to deal with unexpected faults or outages of the telemetry and logging systems.

              I once bootstrapped an entire datacenter with tens of thousands of nodes from an SSH terminal after an abrupt power failure. It turns out infrastructure has lots of circular dependencies and we had to manually break that dependency.

              • ramzyo 4 hours ago

                Exactly this. Have heard it referred to as "break glass access". Some form of remote access, be it SSH or otherwise, in case of serious emergency.

            • viraptor 5 hours ago

              Passive metrics/logs won't let you debug all the issues. At some point you either need a system for automatic memory dumps and submitting bpf scripts to live nodes... or you need SSH access to do that.

              • otabdeveloper4 2 hours ago

                This "system for automatic dumps" 100 percent uses ssh under the hood. Probably with some eternal sudo administrator key.

                Personal ssh access is always better (from a security standpoint) than bot tokens and keys.

          • from-nibly 7 hours ago

            You don't. you shoot it in the head and get a new one. If you need logging / telemetry bake it into the image.

            • otabdeveloper4 33 minutes ago

              Are you from techsupport?

              Actually not every problem is solved with the "have you tried turning it off and back on again" trick.

  • tptacek 9 hours ago

    I'm a fan of SSH certificates and cannot understand why anyone would set up certificate authentication with an external third-party CA. When I'm selling people on SSH CA's, the first thing I usually have to convince them of is that I'm not saying they should trust some third party. You know where all your servers are. External CAs exist to solve the counterparty introduction problem, which is a problem SSH servers do not have.

    • kevin_nisbet 5 hours ago

      I'm with you, I imagine it's mostly people just drawing parallels, they can figure out how to get a web certificate so think SSH is the same thing.

      The second order problem I've found is when you dig in there are plenty of people who ask for certs but when push comes to shove really want functionality where when user access is cancelled all active sessions get torn down immediatly as well.

    • xyst 9 hours ago

      Same reasons for companies still buying “CrowdStrike” and installing that crapware. It’s all for regulatory checkboxes (ie, fedramp cert).

      • tptacek 8 hours ago

        I do not believe you in fact need any kind of SSH CA, let alone one run by a third party, to be FedRAMP-compliant.

  • antoniomika 10 hours ago

    I wrote a system that did this >5 years ago (luckily was able to open source it before the startup went under[0]). The bastion would record ssh sessions in asciicast v2 format and store those for later playback directly from a control panel. The main issue that still isn't solved by a solution like this is user management on the remote (ssh server) side. In a more recent implementation, integration with LDAP made the most sense and allows for separation of user and login credentials. A single integrated solution is likely the holy grail in this space.

    [0] https://github.com/notion/bastion

    • mdaniel 10 hours ago

      Out of curiosity, why ignore this PR? https://github.com/notion/bastion/pull/13

      I would think even a simple "sorry, this change does not align with the project's goals" -> closed would help the submitter (and others) have some clarity versus the PR limbo it's currently in

      That aside, thanks so much for pointing this out: it looks like good fun, especially the Asciicast support!

      • antoniomika 9 hours ago

        Honestly never had a chance to merge it/review it. Once the company wound down, I had to move onto other things (find a new job, work on other priorities, etc) and lost access to be able to do anything with it after. I thought about forking it and modernizing it but never came to fruition.

  • INTPenis an hour ago

    Properly setup IaC, that treats Linux as an appliance instead, could get rid of SSH altogether.

    I'm only saying this because after 20+ years as a sysadmin I feel like there have been no decent solutions presented. On the other hand, to protect my IaC and Gitops I have seen very decent and mature solutions.

    • otabdeveloper4 13 minutes ago

      I don't know what exactly you mean by "IaC" here, but the ones I know use SSH under the hood somewhere. (Except with some sort of "bot admin" key now, which is strictly worse.)

  • shermantanktop 9 hours ago

    I didn’t understand the marketing term “zero trust” and I still don’t.

    In practice, I get it - a network zone shouldn’t require a lower authn/z bar on the implicit assumption that admission to that zone must have required a higher bar.

    But all these systems are built on trust, and if it isn’t based on network zoning, it’s based on something else. Maybe that other thing is better, maybe not. But it exists and it needs to be understood.

    An actual zero trust system is the proverbial unpowered computer in a bunker.

    • wmf 8 hours ago

      The something else is specifically user/service identity. Not machine identity, not IP address. It is somewhat silly to have a buzzword that means "no, actually authenticate users" but here we are.

    • athorax 6 hours ago

      It means there is zero trust of a device/service/user on your network until they have been fully authenticated. It is about having zero trust in something just because it is inside your network perimeter.

    • ngneer 9 hours ago

      With you there. The marketing term makes Zero Sense to me.

  • EthanHeilman 11 hours ago

    I'm a member of the team that worked on this happy to answer any questions.

    We (BastionZero) recently got bought by Cloudflare and it is exciting bringing our SSH ideas to Cloudflare.

  • johnklos 9 hours ago

    So... don't trust long lived ssh keys, but trust Cloudflare's CA. Why? What has Cloudflare done to earn trust?

    If that alone weren't reason enough to dismiss this, the article has marketing BS throughout. For instance, "SSH access to a server often comes with elevated privileges". Ummm... Every authentication system ever has whatever privileges that come with that authentication system. This is the kind of bull you say / write when you want to snow someone who doesn't know any better. To those of us who do understand this, this is almost AI level bullshit.

    The same is true of their supposed selling points:

    > Author fine-grained policy to govern who can SSH to your servers and through which SSH user(s) they can log in as.

    That's exactly what ssh does. You set up precisely which authentication methods you accept, you set up keys for exactly that purpose, and you set up individual accounts. Do Cloudflare really think we're setting up a single user account and giving access to lots of different people, and we need them to save us? (now that I think about it, I bet some people do this, but this is still a ridiculous selling point)

    > Monitor infrastructure access with Access and SSH command logs

    So they're MITM all of our connections? We're supposed to trust them, even though they have a long history of not only working with scammers and malicious actors, but protecting them?

    I suppose there's a sucker born every minute, so Cloudflare will undoubtedly sell some people on this silliness, but to me it just looks like yet another way that Cloudflare wants to recentralize the Internet around them. If they had their way, then in a few years, were they to go down, a majority of the Internet would literally stop working. That should scare everyone.

  • curben 6 hours ago

    Cloudflare has been offering SSH CA-based authentication for more than 2 years [1], I wrote a guide back in feb '23 [2]. The announcement is more about offering new features, such as more granular user control.

    [1]:https://web.archive.org/web/20210418143636/https://developer...

    [2]: https://mdleom.com/blog/2023/02/13/ssh-certificate-cloudflar...

  • keepamovin 3 hours ago

    Does this give CloudFlare a backdoor to all your servers? That would not strictly be ZT, as some identify in the comments here.

    • knallfrosch an hour ago

      Everything rests on CloudFlare's key.

  • cyberax 9 hours ago

    Hah. I did pretty much all the same stuff in my previous company.

    One thing that we did a bit better: we used AWS SSM to provision our SSH-CA certificates onto the running AWS EC2 instances during the first connection.

    It would be even better if AWS allowed to use SSH CA certs as keys, but alas...

    • pugz 5 hours ago

      FYI I love your work with Gimlet, etc.

      I too would love "native" support for SSH CAs in EC2. What I ended up doing is adding a line to every EC2 userdata script that would rewrite the /home/ec2-user/.ssh/authorized_keys file to treat the provided EC2 keypair as a CA instead of a regular pubkey.

  • arianvanp 8 hours ago

    Zero trust. But they don't solve the more interesting problem: host key authentication.

    Would be nice if they can replace TOFU access with SSH CA as well. Ideally based on device posture of the server (e.g. TPM2 attestation)

  • anilakar an hour ago

    Every now and then a new SSH key management solution emerges and every time it is yet another connection-terminating proxy and not a real PKI solution.

  • advael 7 hours ago

    You know you can just do this with keyauth and a cron job, right?

    • wmf 6 hours ago

      And Dropbox is a wrapper around rsync.

      • advael 6 hours ago

        Generally speaking a lot of "essential tools" in "cloud computing" are available as free, boring operating system utilities.

        • kkielhofner 4 hours ago

          It’s a joke from a famous moment in HN history:

          https://news.ycombinator.com/item?id=9224

          • advael 2 hours ago

            That is pretty funny, and the whole idea that you can't make money packaging open-source software in a way that's more appealing to people is definitely funny given that this is the business model of a lot of successful companies

            I do however think this leads to a lot of problems when those companies try to protect their business models, as we are seeing a lot of today