How do you deploy in 10 seconds?

(paravoce.bearblog.dev)

62 points | by mpweiher 4 days ago ago

57 comments

  • notwhereyouare 3 days ago

    100% this is how my company used to deploy. we had multiple servers. rsync'd code to each server and cycled IIS. Worked good. Deploys to our farm took just a minute or two, because it would do a double deploy just to be extra sure everything went out.

    Then the BORG came and assimilated us. Our deploys take easily 45+ minutes to really start shifting traffic

    • valbaca 3 days ago

      what's IIS?

      • AndrewDucker 3 days ago

        Internet Information Server - the MS web server.

        • pier25 2 days ago

          Is it still in use? I thought it had been replaced by Kestrel.

      • notwhereyouare 3 days ago

        as andrew mentioned, microsoft's server. we are a .net shop

  • 0xbadcafebee 3 days ago

    I've regularly gotten CI/CD deploys down to <30 seconds without a ton of fancy caching. You just need to look at what's taking a lot of time, and optimize.

    - On commit/push, your build runs once, and stores in an artifact. If nothing has changed, don't rebuild, reuse.

    - Your build gets packed into a Docker container once and pushed to a remote registry. If nothing has changed, don't rebuild, reuse.

    - Every test and subsequent stage uses the same build artifact and/or container. Again, this is as simple as pulling a binary or image. Within the same pipeline workspace, it's a file on disk shared between jobs.

    - Using a self-hosted CI/CD runner, on the same network and provider as your artifact/container registry, means extremely low-latency, high-bandwidth file transfers. And because it's self-hosted, you don't have to wait for a runner to be allocated, it's waiting for you; it's just connected to remotely and immediately used. K8s runners on autoscaling clusters make it easy to scale jobs in parallel.

    - Having each pipeline step use a prebuilt Docker container, and not having each step do a bunch of repetitive stuff (like installing tools, downloading deps...) when they don't need to, is essential. If every single job is doing the same network transfer and same tool install every time, optimize it.

    - A kubernetes deploy to production should absolutely take its time to cycle an old and new pod. Half of the point of K8s is to prevent interrupting production traffic and rely on it to automatically resolve issues and prevent larger problems. This means leaning on health checks and ramping traffic for safety. But actually running the deploy part should be nearly instantaneous, a `kubectl apply` or `helm upgrade` should take seconds.

    The only exception to all this is if you (rightly) have a very large test suite that takes a while to go through. You can still optimize the hell out of tests for speed and parallelize them a lot, though.

  • mikeocool 3 days ago

    In my experience actually getting the code to the prod servers and restarting the app is rarely the slow part of CD. These days it mostly seems to be: 1) building all the javascript and 2) running the tests.

    • hellcow 3 days ago

      Can't help you with the JS compile times, sadly. I think that bed is made. :)

      I prefer to run tests locally whenever possible, for instance as a git hook, rather than in a CI instance. If you need auditability for something like PCI, that approach probably won't work, but I think the small web (i.e. most of the web) can do just fine with it.

  • syndicatedjelly 3 days ago

    > Every developer knows how to compile locally. Only a few at my company understand Docker's layer caching + mounts and Github Actions' yaml-based workflow configuration in any detail. So when something doesn't work compiling locally, it's also a lot easier for a typical developer to figure out why.

    How is this a good excuse? Is it really that difficult for a developer to spend an afternoon understanding GitHub Actions and Docker, at least at a superficial level so they can understand what they're looking at?

    • HL33tibCe7 3 days ago

      Understanding these things at a level where you can optimise a CI pipeline is actually quite difficult. The evidence for this is that almost every CI pipeline I’ve ever seen, in the many companies have been at, has been horrendously unoptimised and slow.

      • syndicatedjelly 3 days ago

        Understanding something and optimizing it are not the same skill set. The author of the article seems to imply that even looking at a pipeline workflow is too daunting for developers, which is what I'm challenging. I personally am working on a horrendously slow pipeline right now, and I agree that it's frustrating to troubleshoot. But ignoring the problem is certainly not the solution - I've slowly but surely chipped away at various aspects and have a much better understanding of our CI/CD now.

        In the words of Jake the Dog, "Sucking at something is the first step to getting good at something."

  • BadBadJellyBean 3 days ago

    We used to do similar things. Then devs pushed stuff to prod without committing them. Then things broke when pushing the version that was in git. Then I forced everything into docker. Things got better. If you want to do things fast invest into local test invironments. CI/CD is more than just a way to deploy things. And maybe a little friction to just push things might push a developer to write a test for the functionality.

    • hellcow 3 days ago

      I solved this a different way -- only very senior engineers were allowed to access/deploy to production. Senior engineers (by experience, not title!) had a much better understanding of the full system; they better understood the risks during a deploy and what to watch. They were doing the PR reviews as well.

      Many ways to skin the cat. This is just one of them.

      • BadBadJellyBean 2 days ago

        Senior engineers made the mess. Because "I just need to fix this now. I'll commit it later". CI/CD gives you stronger guarantees. You can know which code is now in prod. That is so much harder to ensure when there is human intervention.

  • greggyb 3 days ago

    Why did this get flagged? Seems very on topic for HN and is not blogspam or clickbait.

    • MortyWaves 2 days ago

      I've been noticing quite poor moderation in the last few days but generally over the last year or so a general lowering of tone and quality.

  • qudat 3 days ago

    Here's my one-weird-trick:

    On my VM, keep this running:

        while true; do ssh pipe.pico.sh sub deploy-app; docker compose pull && docker compose up -d; done
    
    On my local machine:

        docker buildx build --push -t ghcr.io/abc/app .; ssh pipe.pico.sh pub deploy-app -e
    
    https://pipe.pico.sh/
    • mdaniel 2 days ago

      Heh, another feather in the "all you need is PostgreSQL" hat: if you already use PG, then LISTEN/NOTIFY can do what that external host does as well as acting like a health check for the instance since if it can't access PG it likely is in a bad way (situationally, of course, like all solutions coming from Internet commentary)

    • yjftsjthsd-h 3 days ago

      Why not just (from the local machine) run

        docker buildx build --push -t ghcr.io/abc/app . && ssh myvm 'cd /my/app/path && docker compose pull && docker compose up -d'
      
      ?
      • qudat 3 days ago

        That's a really good point!

        That might work if you have a single VM but it's a little more complicated when you have an app on multiple instances.

        pipe is a multicast pubsub which means you can have many subscribers.

  • hellcow 3 days ago

    Oh wow, my blog post. Hi all! I was wondering why I had a huge surge of readership today. I'll be in the comments.

    • ledgerdev 3 days ago

      Thanks so much for this post and the other about provisioning. I'm going to try this exactly. Great suggestion about having caddy just use try_duration to minimize downtime.

    • easterncalculus 3 days ago

      Curious, how did you follow your readership? I thought bearblog didn't have analytics, but I guess you must be running something.

      • hellcow 3 days ago

        They actually do have analytics, but you need to subscribe (it's a small service, if you can, pay a few bucks to support good people!).

        I can see a count of readers per day on each post. It also shows counts of devices, browsers, countries, and referrers. Here's what it looks like: https://herman.bearblog.dev/public-analytics/

  • torvald 3 days ago

    This is a great approach if you have an idea or startup and just want to get things done -- one I would choose 10 out of 10 times when starting something new. You'll know when it’s time to move on, and you can likely postpone that step a bit longer as well.

  • oneplane 3 days ago

    Or (on K8S) you set your drain time to 0, the surge to 9999999% and the PDB to "screw everything". Now your deployments take 2 seconds (the time to pull down your change and run it).

    You also just lost all your guardrails and collaborative controls, as well as created a dependency on all engineers being equally capable.

    In other words, unless you are DHH and don't have to scale (both in terms of workload and terms of company), this scenario doesn't apply in the real world.

    • EatFlamingDeath 3 days ago

      Exactly. I mean, I understand that 45+ minutes to deploy something that takes less than a minute to build is obnoxious, but the pipeline is not always there to only build the app. Deploying in 10 seconds means no safeguards and that you can send broken code to production. And pipelines are about automation too. Having a sane pipeline that will check formatting, linting, test, build and deploy quickly to a server is not that hard. Well, if you don't care for production being down for a couple of minutes, fine, do the "10 second deployment". But, at least for me, even in really small projects, it doesn't make any sense.

      • oneplane 3 days ago

        Indeed. Same goes for perceived 'slowness'. At the point where you're deploying to production, it should already be fire-and-forget; your local development, tests, acceptance or whatever else you have should already have been done. There is no "gee I wonder what this looks like once deployed". Or at least, there shouldn't be... (another red flag for the 10 second crowd)

  • hilti 3 days ago

    I‘m using rsync too. Works great. Unfortunately my managed server at Hetzner does not allow to run Go apps as services. That‘s the last step to figure out for me.

  • samuli 3 days ago
  • from-nibly 3 days ago

    DevOps is about putting friction in the right places, not removing it entirely.

  • n_ary 3 days ago

    While the author briefly mention Ansible for the next post, it was a dramatic improvement for doing fairly medium scale deployment and maintenance. The playbook dry run was godsend magic that cured a lot of headache.

  • indulona 3 days ago

    I too prefer simplicity and getting things done over wasted time and money disguised as a service by some company that tries to get in between me and what i want to do.

  • whatever1 3 days ago

    I splurge on a fancier server and just git pull and build on the server. Less than 10 seconds for sure.

  • BobbyTables2 3 days ago

    Easy - just don’t run any tests!

  • krick 3 days ago

    I mean, isn't it like everybody used to do it? Some trigger to git pull to one node and rsync to others. Plus some reverse-proxy configuration to make it smooth. Then came CI/CD, Docker, k8s, ArgoCD. Honestly, I'm still not convinced that benefits outweight the costs, but the choice seems to have been made consciously. So the "secret source" is bit banal here.

    • vrosas 3 days ago

      Meh. The original places that created all this tech had great reasons to do so. Somes places today still do, but most that implement these practices or pieces of infra aren't do it to solve a specific problem, they're just doing it because that's what everyone else does. Not OP, but I worked at a billion dollar company whose whole deployment process was very simple Makefile.

  • someothherguyy 2 days ago

    if you don't drain connections, you are slapping users in the face. if your deploy commands fail, you are shipping downtime.

  • renewiltord 2 days ago

    Skip compilation. Just edit in prod. Zero second deploy.

  • deathanatos 3 days ago

    > Gradual rollouts are slow too, taking 5-10 minutes as k8s rolls out updates to pods across the fleet.

    If you're hitting this, you need to take a look into the service as the problem, not blame the infra layer.

    k8s can absolute roll out a deployment in <60s, if not <10s. The bottleneck I see, when it is slow, is slow app termination. If your service takes 5 minutes to terminate, it isn't going to matter what the infrastructure layer does. Sometimes this is failing to handle SIGTERM (resulting in k8s having to fall back on timing out & SIGKILL'ing) … but sometimes it's just the app is slow to terminate, 'cause bugs. But it's those bugs that should get fixed.

    You can somewhat workaround it by setting the surge to 100%. (And … even if you have a fast app, 100% surge might be a good idea, too. As always, it depends. If surging is going to eat up all the available RAM or CPU … maybe not.)

    And most importantly: the underlying principles guiding k8s's behavior are going to apply just as equally to a shell script. app.service doesn't respond to SIGTERM[1]? You're going to have to decide what to do. Surge or not? Same thing. Potential for surge to result in resource depletion…?

    > bash script

    A service's program/code should generally be owned by root:. A service (generally) does not need the ability to re-write its own code.

    > Only a few at my company understand Docker's layer caching + mounts and Github Actions' yaml-based workflow configuration in any detail.

    … the working knowledge of either of those two things is not rocket science. The docker caching is probably the worst of the two; but you only need to understand it to speed up builds.

    While GHA's YAML isn't pretty … it's also hardly complex. And for the most part, if your action simply defers to a script in the repository (e.g., I keep these in ci/), then it's mostly reproducible locally, too. (And there are some tools out there to run a GHA workflow locally, too, if you need more completeness than "just run the same script as the workflow".)

    > Show you how I provision a Debian server using Ansible

    I have spent enough years with Ansible to know all the problems with it, and I'd rather not go back to it.

    ([1] although a vanilla systemd service is going to have an "advantage" in that the default SIGTERM handling is different from a container. So it might look faster, in the case of buggy-app-with-no-SIGTERM-handler will die instantly … but it's probably still a bug, as ax'ing the service is probably also just dropping requests on the floor.)

  • jraph 3 days ago

    Since nobody has dared to state the obvious for the entire hour this post has been posted on HN, I'll do it, even if it costs me, so nobody else has to.

    > How do you deploy in 10 seconds?

    By editing in production, of course.

    Thank you for your attention.

    • toast0 3 days ago

      If you use something to send your input to multiple terminals, you can edit in production on your whole cluster at once. It can be a little tricky if the servers have diverged a bit.

      • jraph 3 days ago

        At DevOops Ltd, we love KISS, we move fast and we know to administrate your servers in no time at a bargain. We keep you from the Kubernetes anxiety. We got you covered. Don't worry about it.

      • athenot 3 days ago

        This isn't really that far from using Ansible to deploy. Effectively logging into X servers all at once and editing the things. This can be a valid strategy.

    • kodama-lens 3 days ago

      At my old company we used to git pull and do make install on the prod server.

      Now I have to file an exceptions for a found buffer overflow vulnerability in libfdisk1 identified in my miminal container image running in a locked down, read only container context. Because ITSac has processes for it.

    • eterm 3 days ago

      I was once in my much younger days, at a company where the development team worked at a different office to the "head office" product team.

      We were a PoS app, and had no SaaS web offering.

      The head office was keen to get one, so someone there prototyped one, then that prototype impressed the board so much, they got a couple of contractors in (with no knowledge of the actual product development team) to flesh out the prototype.

      Eventually the product team admitted what they were doing and brought in the developers to take a look at how they were working.

      They were using git, but not really using it. Because their actual mode of working was to SSH into the production machine, and use vim to edit the code.

      Multiple users. SSHing into the same space, editing the same files. Sure, they had a git log, but it wasn't exactly best practices.

      It was also all PHP, but we were a VB6/.NET shop, so there was also some friction around whether we should all learn PHP and embrace that going forward.

      I was half impressed they managed to get so far with their prototype, half horrified by what I found.

      There was not just no security consideration, they didn't know what an IDOR was.

      I was incredibly impressed by the professionalism of the security auditor they got to review the code. I learned a heck of a lot from him in the few days he was with them. I regret to this day not writing down his name.

      But yeah, those deployments were probably the fastest they would ever deploy.

      Despite being impressed by the forward thinking of the product team to force the issue and get some kind of SaaS web presence, I sharpened up my CV and got a job which didn't involve either VB6 or PHP.

    • samtheprogram 3 days ago

      If you want some sanity, bare git repos, of course :)

    • fasteo 3 days ago

      John 8:7

    • 3 days ago
      [deleted]
  • phendrenad2 3 days ago

    How does flagging work on HN? People flag something and it's immediately removed from the homepage, and then presumably later the admin looks at it and decides if it should be flagged?

    Because this story being flagged in WILD ya'll.

    • 3 days ago
      [deleted]
  • breck 3 days ago

    10 seconds, why not 1s?

    We deploy in <1s. New sites, mods, small sites, large.

    • crummy 3 days ago

      Perhaps you could talk about how you do it?

      • breck 3 days ago

        Let's take a conservative estimate of 200MB/s write speed on a Digital Ocean Droplet.

        If your site is 100MB, write the new site to disk and read it into memory, kill the old server and bind to port on new server, in 1 second.

  • catlover76 2 days ago

    [dead]