How do you deploy in 10 seconds?

(paravoce.bearblog.dev)

62 points | by mpweiher 8 months ago ago

57 comments

100% this is how my company used to deploy. we had multiple servers. rsync'd code to each server and cycled IIS. Worked good. Deploys to our farm took just a minute or two, because it would do a double deploy just to be extra sure everything went out.

Then the BORG came and assimilated us. Our deploys take easily 45+ minutes to really start shifting traffic

[-]

valbaca 8 months ago

what's IIS?

[-]

AndrewDucker 8 months ago

Internet Information Server - the MS web server.

[-]

pier25 8 months ago

Is it still in use? I thought it had been replaced by Kestrel.

[-]

AndrewDucker 8 months ago

Kestrel does ASP.Net very quickly, because that's all it does. If you want a full web server, that does things like static files, SSL, Reverse Proxying, etc. then you want IIS at the very least sitting in front of it.

See https://stackify.com/kestrel-web-server-asp-net-core-kestrel... - comparison table 3/4 of the way down.

notwhereyouare 8 months ago

as andrew mentioned, microsoft's server. we are a .net shop

0xbadcafebee 8 months ago

I've regularly gotten CI/CD deploys down to <30 seconds without a ton of fancy caching. You just need to look at what's taking a lot of time, and optimize.

- On commit/push, your build runs once, and stores in an artifact. If nothing has changed, don't rebuild, reuse.

- Your build gets packed into a Docker container once and pushed to a remote registry. If nothing has changed, don't rebuild, reuse.

- Every test and subsequent stage uses the same build artifact and/or container. Again, this is as simple as pulling a binary or image. Within the same pipeline workspace, it's a file on disk shared between jobs.

- Using a self-hosted CI/CD runner, on the same network and provider as your artifact/container registry, means extremely low-latency, high-bandwidth file transfers. And because it's self-hosted, you don't have to wait for a runner to be allocated, it's waiting for you; it's just connected to remotely and immediately used. K8s runners on autoscaling clusters make it easy to scale jobs in parallel.

- Having each pipeline step use a prebuilt Docker container, and not having each step do a bunch of repetitive stuff (like installing tools, downloading deps...) when they don't need to, is essential. If every single job is doing the same network transfer and same tool install every time, optimize it.

- A kubernetes deploy to production should absolutely take its time to cycle an old and new pod. Half of the point of K8s is to prevent interrupting production traffic and rely on it to automatically resolve issues and prevent larger problems. This means leaning on health checks and ramping traffic for safety. But actually running the deploy part should be nearly instantaneous, a `kubectl apply` or `helm upgrade` should take seconds.

The only exception to all this is if you (rightly) have a very large test suite that takes a while to go through. You can still optimize the hell out of tests for speed and parallelize them a lot, though.

mikeocool 8 months ago

In my experience actually getting the code to the prod servers and restarting the app is rarely the slow part of CD. These days it mostly seems to be: 1) building all the javascript and 2) running the tests.

[-]

hellcow 8 months ago

Can't help you with the JS compile times, sadly. I think that bed is made. :)

I prefer to run tests locally whenever possible, for instance as a git hook, rather than in a CI instance. If you need auditability for something like PCI, that approach probably won't work, but I think the small web (i.e. most of the web) can do just fine with it.

syndicatedjelly 8 months ago

> Every developer knows how to compile locally. Only a few at my company understand Docker's layer caching + mounts and Github Actions' yaml-based workflow configuration in any detail. So when something doesn't work compiling locally, it's also a lot easier for a typical developer to figure out why.

How is this a good excuse? Is it really that difficult for a developer to spend an afternoon understanding GitHub Actions and Docker, at least at a superficial level so they can understand what they're looking at?

[-]

HL33tibCe7 8 months ago

Understanding these things at a level where you can optimise a CI pipeline is actually quite difficult. The evidence for this is that almost every CI pipeline I’ve ever seen, in the many companies have been at, has been horrendously unoptimised and slow.

[-]

syndicatedjelly 8 months ago

Understanding something and optimizing it are not the same skill set. The author of the article seems to imply that even looking at a pipeline workflow is too daunting for developers, which is what I'm challenging. I personally am working on a horrendously slow pipeline right now, and I agree that it's frustrating to troubleshoot. But ignoring the problem is certainly not the solution - I've slowly but surely chipped away at various aspects and have a much better understanding of our CI/CD now.

In the words of Jake the Dog, "Sucking at something is the first step to getting good at something."

BadBadJellyBean 8 months ago

We used to do similar things. Then devs pushed stuff to prod without committing them. Then things broke when pushing the version that was in git. Then I forced everything into docker. Things got better. If you want to do things fast invest into local test invironments. CI/CD is more than just a way to deploy things. And maybe a little friction to just push things might push a developer to write a test for the functionality.

[-]

hellcow 8 months ago

I solved this a different way -- only very senior engineers were allowed to access/deploy to production. Senior engineers (by experience, not title!) had a much better understanding of the full system; they better understood the risks during a deploy and what to watch. They were doing the PR reviews as well.

Many ways to skin the cat. This is just one of them.

[-]

BadBadJellyBean 8 months ago

Senior engineers made the mess. Because "I just need to fix this now. I'll commit it later". CI/CD gives you stronger guarantees. You can know which code is now in prod. That is so much harder to ensure when there is human intervention.

greggyb 8 months ago

Why did this get flagged? Seems very on topic for HN and is not blogspam or clickbait.

[-]

MortyWaves 8 months ago

I've been noticing quite poor moderation in the last few days but generally over the last year or so a general lowering of tone and quality.

qudat 8 months ago

Here's my one-weird-trick:

On my VM, keep this running:

    while true; do ssh pipe.pico.sh sub deploy-app; docker compose pull && docker compose up -d; done

On my local machine:

    docker buildx build --push -t ghcr.io/abc/app .; ssh pipe.pico.sh pub deploy-app -e

https://pipe.pico.sh/

[-]

mdaniel 8 months ago

Heh, another feather in the "all you need is PostgreSQL" hat: if you already use PG, then LISTEN/NOTIFY can do what that external host does as well as acting like a health check for the instance since if it can't access PG it likely is in a bad way (situationally, of course, like all solutions coming from Internet commentary)

yjftsjthsd-h 8 months ago

Why not just (from the local machine) run

  docker buildx build --push -t ghcr.io/abc/app . && ssh myvm 'cd /my/app/path && docker compose pull && docker compose up -d'

?

[-]

qudat 8 months ago

That's a really good point!

That might work if you have a single VM but it's a little more complicated when you have an app on multiple instances.

pipe is a multicast pubsub which means you can have many subscribers.

hellcow 8 months ago

Oh wow, my blog post. Hi all! I was wondering why I had a huge surge of readership today. I'll be in the comments.

[-]

ledgerdev 8 months ago

Thanks so much for this post and the other about provisioning. I'm going to try this exactly. Great suggestion about having caddy just use try_duration to minimize downtime.

easterncalculus 8 months ago

Curious, how did you follow your readership? I thought bearblog didn't have analytics, but I guess you must be running something.

[-]

hellcow 8 months ago

They actually do have analytics, but you need to subscribe (it's a small service, if you can, pay a few bucks to support good people!).

I can see a count of readers per day on each post. It also shows counts of devices, browsers, countries, and referrers. Here's what it looks like: https://herman.bearblog.dev/public-analytics/

[-]

easterncalculus 8 months ago

Thanks!

torvald 8 months ago

This is a great approach if you have an idea or startup and just want to get things done -- one I would choose 10 out of 10 times when starting something new. You'll know when it’s time to move on, and you can likely postpone that step a bit longer as well.

oneplane 8 months ago

Or (on K8S) you set your drain time to 0, the surge to 9999999% and the PDB to "screw everything". Now your deployments take 2 seconds (the time to pull down your change and run it).

You also just lost all your guardrails and collaborative controls, as well as created a dependency on all engineers being equally capable.

In other words, unless you are DHH and don't have to scale (both in terms of workload and terms of company), this scenario doesn't apply in the real world.

[-]

EatFlamingDeath 8 months ago

Exactly. I mean, I understand that 45+ minutes to deploy something that takes less than a minute to build is obnoxious, but the pipeline is not always there to only build the app. Deploying in 10 seconds means no safeguards and that you can send broken code to production. And pipelines are about automation too. Having a sane pipeline that will check formatting, linting, test, build and deploy quickly to a server is not that hard. Well, if you don't care for production being down for a couple of minutes, fine, do the "10 second deployment". But, at least for me, even in really small projects, it doesn't make any sense.

[-]

oneplane 8 months ago

Indeed. Same goes for perceived 'slowness'. At the point where you're deploying to production, it should already be fire-and-forget; your local development, tests, acceptance or whatever else you have should already have been done. There is no "gee I wonder what this looks like once deployed". Or at least, there shouldn't be... (another red flag for the 10 second crowd)

hilti 8 months ago

I‘m using rsync too. Works great. Unfortunately my managed server at Hetzner does not allow to run Go apps as services. That‘s the last step to figure out for me.

samuli 8 months ago

Here's how Twitter did it 15 years ago: https://blog.x.com/engineering/en_us/a/2010/murder-fast-data...

n_ary 8 months ago

While the author briefly mention Ansible for the next post, it was a dramatic improvement for doing fairly medium scale deployment and maintenance. The playbook dry run was godsend magic that cured a lot of headache.

from-nibly 8 months ago

DevOps is about putting friction in the right places, not removing it entirely.

indulona 8 months ago

I too prefer simplicity and getting things done over wasted time and money disguised as a service by some company that tries to get in between me and what i want to do.

whatever1 8 months ago

I splurge on a fancier server and just git pull and build on the server. Less than 10 seconds for sure.

BobbyTables2 8 months ago

Easy - just don’t run any tests!

krick 8 months ago

I mean, isn't it like everybody used to do it? Some trigger to git pull to one node and rsync to others. Plus some reverse-proxy configuration to make it smooth. Then came CI/CD, Docker, k8s, ArgoCD. Honestly, I'm still not convinced that benefits outweight the costs, but the choice seems to have been made consciously. So the "secret source" is bit banal here.

[-]

vrosas 8 months ago

Meh. The original places that created all this tech had great reasons to do so. Somes places today still do, but most that implement these practices or pieces of infra aren't do it to solve a specific problem, they're just doing it because that's what everyone else does. Not OP, but I worked at a billion dollar company whose whole deployment process was very simple Makefile.

someothherguyy 8 months ago

if you don't drain connections, you are slapping users in the face. if your deploy commands fail, you are shipping downtime.

deathanatos 8 months ago

> Gradual rollouts are slow too, taking 5-10 minutes as k8s rolls out updates to pods across the fleet.

If you're hitting this, you need to take a look into the service as the problem, not blame the infra layer.

k8s can absolute roll out a deployment in <60s, if not <10s. The bottleneck I see, when it is slow, is slow app termination. If your service takes 5 minutes to terminate, it isn't going to matter what the infrastructure layer does. Sometimes this is failing to handle SIGTERM (resulting in k8s having to fall back on timing out & SIGKILL'ing) … but sometimes it's just the app is slow to terminate, 'cause bugs. But it's those bugs that should get fixed.

You can somewhat workaround it by setting the surge to 100%. (And … even if you have a fast app, 100% surge might be a good idea, too. As always, it depends. If surging is going to eat up all the available RAM or CPU … maybe not.)

And most importantly: the underlying principles guiding k8s's behavior are going to apply just as equally to a shell script. app.service doesn't respond to SIGTERM[1]? You're going to have to decide what to do. Surge or not? Same thing. Potential for surge to result in resource depletion…?

> bash script

A service's program/code should generally be owned by root:. A service (generally) does not need the ability to re-write its own code.

> Only a few at my company understand Docker's layer caching + mounts and Github Actions' yaml-based workflow configuration in any detail.

… the working knowledge of either of those two things is not rocket science. The docker caching is probably the worst of the two; but you only need to understand it to speed up builds.

While GHA's YAML isn't pretty … it's also hardly complex. And for the most part, if your action simply defers to a script in the repository (e.g., I keep these in ci/), then it's mostly reproducible locally, too. (And there are some tools out there to run a GHA workflow locally, too, if you need more completeness than "just run the same script as the workflow".)

> Show you how I provision a Debian server using Ansible

I have spent enough years with Ansible to know all the problems with it, and I'd rather not go back to it.

([1] although a vanilla systemd service is going to have an "advantage" in that the default SIGTERM handling is different from a container. So it might look faster, in the case of buggy-app-with-no-SIGTERM-handler will die instantly … but it's probably still a bug, as ax'ing the service is probably also just dropping requests on the floor.)

renewiltord 8 months ago

Skip compilation. Just edit in prod. Zero second deploy.

jraph 8 months ago

Since nobody has dared to state the obvious for the entire hour this post has been posted on HN, I'll do it, even if it costs me, so nobody else has to.