Personally - just let the developer own the machine they use for development.
If you really need consistency for the environment - Let them own the machine, and then give them a stable base VM image, and pay for decent virtualization tooling that they run... on their own machine.
I have seen several attempts to move dev environments to a remote host. They invariably suck.
Yes - that means you need to pay for decent hardware for your devs, it's usually cheaper than remote resources (for a lot of reasons).
Yes - that means you need to support running your stack locally. This is a good constraint (and a place where containers are your friend for consistency).
Yes - that means you need data generation tooling to populate a local env. This can be automated relatively well, and it's something you need with a remote env anyways.
---
The only real downside is data control (ie - the company has less control over how a developer manages assets like source code). I'm my experience, the vast majority of companies should worry less about this - your value as a company isn't your source code in 99.5% of cases, it's the team that executes that source code in production.
If you're in the 0.5% of other cases... you know it and you should be in an air-gapped closed room anyways (and I've worked in those too...)
And the reason they suck is the feedback loop is just too high as compared to running it locally. You have to jump through hoops to debug/troubleshoot your code or any issues that you come across between your code and output of your code. And it's almost impossible to work on things when you have spotty internet.
I haven't worked on extremely sensitive data but for PII data from prod to dev, scrubbing is a good practice to follow. This will vary based on the project/team you're on of course.
We tried this approach at a former company with ~600 engineers at the time.
Trying to boot the full service on a single machine required every single developer in the company installing ~50ish microservices on their machine, for things to work correctly. Became totally intractable.
I guess one can grumble about bad architecture all day but this had to be solved. we had to move to remote development environments which restored everyone’s sanity.
Both FAANG companies I’ve worked at had remote dev environments that were built in house.
Most teams/products I have been involved in, the stack always grows to the point that a dev can no longer test it on their own machine, regardless of how big the machine is. And having a different development machine than production leads to completely predictable and unavoidable problems. Devs need to create the software tooling to make remote dev less painful. I mean, they're devs... making software is kind of their whole thing.
I have used remote dev machines just fine, but my workflow vastly differs from many of my coworkers: terminal-only spacemacs + tmux + mosh. I have a lot of CLI and TUI tools, and I do not use VScode at all. The main GUI app I run is a browser, and that runs locally.
I have worked on developing VMs for other developers that rely on a local IDE such. The main sticking point is syncing and schlepping source code (something my setup avoids because the source code and editor is on the remote machine). I have tried a number of approaches, and I sympathize with the article author. So, in response to "Devs need to create the software tooling to make remote dev less painful. I mean, they're devs... making software is kind of their whole thing." <-- syncing and schlepping source code is by no means a solved problem.
I can also say that, my spacemacs config is very vanilla. Like my phone, I don't want to be messing with it when I want to code. Writing tooling for my editor environment is a sideshow for the work I am trying to finish.
I am hardly a dev but occasionally have had to do some or some scripting or web stuff and have really loved VSCode and using the remote SSH support to basically feel like I’m coding locally. Does that not work for your devs?
We have a project which spawns around 80 Docker containers and runs pretty OK on a 5 year old Dell laptop with 16GB RAM. The fans run crazy and the laptop is always very hot but I haven't noticed considerable lags, even with IntelliJ running. Most services are written in Go though and are pretty lightweight.
"Most teams/products I have been involved in, the stack always grows to the point that a dev can no longer test it on their own machine"
Isn't this problem solved by CICD? When the developer is ready to test, they make a commit, and the pipeline deploys the code to a dev/test environment. That's how my teams have been doing it.
This turns a 1 hour task into a 1 day task. Fast feedback cycles are critical to software development.
I don't quite understand how people get into the situation where their work can't fit on their workstation. I've worked on huge projects at huge tech companies, and I could run everything on my workstation. I've worked at startups where the CI situation was passing 5% of the time and required 3 hours to run, that you can now run on your workstation in seconds. What you do is fix the stuff that doesn't fit.
The most insidious source of slowness I've encountered is tests that use test databases set to fsync = on. This severely limits parallelism and speed in a way that's difficult to diagnose; you have plenty of CPU and memory available, but the tests just aren't going very fast. (I don't remember how I stumbled upon this insight. I think I must have straced Postgres and been like "ohhhhhhhhh, of course".)
It's likely you haven't come across these use cases in your professional career, but I assure you its very common. My entire career has only seen projects where you need dozen to hundreds of CPU's in order to have a short feedback loop to verify the system works. I saw this in simple algorithms in automotive, to Advanced Driver Assistance Systems and machine learning applications.
When you are working on a software project that has 1,000 active developers checking in code daily and require a stable system build you need lots of compute.
> the stack always grows to the point that a dev can no longer test it on their own machine
So the solution here is to not have that kind of "stack".
I mean, if it's all so big and complex that it can't be run on a laptop then you almost certainly got a lot of problems regardless. What typically happens is tons of interconnected services without clear abstractions or interfaces, and no one really understands this spaghetti mess, and people just keep piling crap on top of it.
This leads to all sorts of problems. Everywhere I've seen this happen they had real problems running stuff in production too, because it was a complex spaghetti mess. The abstracted "easy" dev-env (in whatever form that came) is then also incredibly complex, finicky, and brittle. Never mind running tests, which is typically even worse. It's not uncommon for it all to be broken for every other new person who joins because changes somewhere broke the setup steps which are only run for new people. Everyone else is afraid to do anything with their machine "because it now works'.
There are some exceptions where you really need a big beefy machine for a dev env and tests, maybe, but they're few and far between.
Really? I can't imagine not running the code locally. Honestly, my company has a micro services architecture, and I will just comment out the docker-compose pieces that I am not using. If I am developing/testing a particular component then I will enable it.
In my last role as a director of engineering at a startup, I found that a project `flake.nix` file (coupled with simply asking people to use https://determinate.systems/posts/determinate-nix-installer/ to install Nix) led to the fastest "new-hire-to-able-to-contribute" time of anything I've seen.
Unfortunately, after a few hires (hand-picked by me), this is what happened:
1) People didn't want to learn Nix, neither did they want to ask me how to make something work with Nix, neither did they tell me they didn't want to learn Nix. In essence, I told them to set the project up with it, which they'd do (and which would be successful, at least initially), but forgot that I also had to sell them on it. In one case, a developer spent all weekend (of HIS time) uninstalling Nix and making things work using the "usual crap" (as I would call it), all because of an issue I could have fixed in probably 5 minutes if he had just reached out to me (which he did not, to my chagrin). The first time I heard them comment their true feelings on it was when I pushed back regarding this because I would have gladly helped... I've mentioned this on various Slacks to get feedback and people have basically said "you either insist on it and say it's the only supported developer-environment-defining framework, or you will lose control over it" /shrug
2) Developers really like to have control over their own machines (but I failed to assume they'd also want this control over the project dependencies, since, after all, I was the one who decided to control mine with the flake.nix in the first place!)
3) At a startup, execution is everything and time is possibly too short (especially if you have kids) to learn new things that aren't simple, even if better... that unfortunately may include Nix.
4) Nix would also be perfect for deployments... except that there is no (to my knowledge) general-purpose, broadly-accepted way to deploy via Nix, except to convert it to a Docker image and deploy that, which (almost) defeats most of the purpose of Nix.
I still believe in Nix but actually trying to use it to "perfectly control" a team's project dependencies (which I will insist it does do, pretty much, better than anything else) has been a mixed bag. And I will still insist that for every 5 minutes spent wrestling with Nix trying to get it to do what you need it to do, you are saving at least an order of magnitude more time spent debugging non-deterministic dependency issues that (as it turns out) were only "accidentally" working in the first place.
After my personal 2-year experiment with NixOS, I'd avoid anything Nix like the plague, and would be looking for a new job if anyone instituted a Nix-only policy.
It's not the learning new things that's a problem, but rather the fact that every little issue turns into a 2-day marathon that's eventually solved with a 1-line fix. And that's because the feedback loop and general UX is just awful - I really started to feel like I needed a sacrificial chicken.
Docker may be a dumpster fire, but at least it's generally easy to see what you did wrong and fix it.
People just straight-up don’t want to learn. There are always exceptions, of course, but IME the majority of people in tech are incurious. They want to do their job, and get paid. Reading man pages is sadly not in that list.
Where "job" is defined in a narrowest way possible to assume minimum responsibility. Still want to get 200k+ salaries though...
This may sound extreme (it really isn't) but as Dr of Eng TP's job was to sus those folks out as early as possible and part ways (the kind where they go work for someone else). Some folks are completely irrational about their setups and no amount of appeasement in the form of "whys" and training is usually sufficient.
This has always made me sad, but I think you're right in a lot of cases. What I've always tried to do is to focus on basic productivity; make sure everyone has everything they need to do their work, and that most people do it in the same way, so you can make progress on the learning journey together. Whenever people ask me for help and want to set up a meeting (not just "please answer this on Slack and I'll leave you alone"), I record the meeting, try to touch on all the related areas of their problem, and then review the recording for things that would be interesting to write about it. If any of the digressions are interesting, I go into Notion, create a new page, and write up a couple paragraphs. Then I give my team "ever wonder what dynamic linking is and how to debug it?" and they can read it and know as much as I know.
I really, really struggle to deal with the fact that people don't know as much as I do (I wrote my first program when I was 4 and I'm 39 now), but I have accepted that it's not a weakness on their part, it's a weakness on my part. I wouldn't lower my standards (as a manager once suggested), but I do feel like it's my obligation to lead them on a journey of learning. That is to say, people don't learn without teaching, so be a teacher.
I think this, or something of equal complexity, is probably the right choice. I have spent a lot of time helping people with their dev environments, and the same problems keep coming up; "no, you need this version of kubectl", "no, you need this version of jq", "no, the Makefile expects THIS version of The Silver Searcher". A mass of shell scripts and random utilities was a consistent drag on the entire team and everyone that interacted with the team.
I ended up going with Bazel, not because of this particular problem alone (though it was part of it; people we hired spent WEEKS trying to get a happy edit/test/debug cycle going), but because proper dependency-based test caching was sorely needed. Using Bazel and Buildbuddy brought CI down from about 17 minutes per run to 3-4 minutes for a typical change, which meant that even if people didn't want to get a local setup going, they could at least be slightly productive. I also made sure that every dependency / tool useful for developing the product was versioned in the repository, so if something needs `psql` you can `bazel run //tools/postgres/psql` and have it just work. (Hate that Postgres can't be statically linked, though.)
It was a lot of work for me, and people do gripe about some things ("I liked `go test ./...`, I can't adjust to `bazel test ...`"), but all in all, it does work well. I would do it again. Day 1 at the company; git clone our thing, install bazelisk, and your environment setup is done. All the tests pass. You can run the app locally with a simple `bazel run`. I'm pretty happy with the outcome.
Nix is something I looked into for our container images, but they just end up being too big. I never figured out why; I think a lot of things are dynamically linked and they include their own /usr/lib tree with the entire transitive dependency chain for that particular app, even if other things you have installed have some overlap with that dependency chain. I prefer the approach of statically linking everything and only including what you need. I compromised by basing things on Debian and rules_distroless, which at least lets you build a container image with the exact same sha256 on two different machines. (We previously just did "FROM scratch; COPY <statically linked binary> /app; ENTRYPOINT /app", but then started needing things like pg_dump in our image. If you can just have a single statically-linked binary be your entire app, great. Sometimes you can't, and then you need some sort of reasonable solution. Also everything ends up growing a dependency on ca-certificates...)
From my perspective, installing Nix seems pretty invasive. I can understand if someone doesn't want to mess with their system "unnecessarily" especially if the tool and it's workings are foreign. And I can't really remember the last time I had issues with non-deterministic dependencies either. Dependency versions are locked. Maybe I'm missing something?
Typically when you start a new dev job the company will provide you with a pre-provisioned laptop that has their security stuff setup and maybe dev tools already installed, eg source code, compilers, VMs, Nix, and a supported editor, so it's not exactly a personal machine that they're messing with.
I think if you take about 80% of your comment and replace "Nix" with "Haskell/Lisp" and a few other techs, you'd basically have the same thing. Especially point #1.
Too true. I think there's a lot of people who don't want control; freedom is responsibility, as the saying goes, and responsibility can be stressful, even if it's liberating also.
Heh, yeah. You gotta put in writing that only userlands defined in Nix will be eligible to enter any environment beyond "dev". And (also put in writing) that their performance in the role will be partly evaluated on their ability to reach out for help with Nix when they need it.
OP here. There definitely is a place for running things on your local machine. Exactly as you say: one can get a great deal of consistency using VMs.
One of the benefits of moving away from Kubernetes, to a runner-based architecture , is that we can now seamlessly support cloud-based and local environments (https://www.gitpod.io/blog/introducing-gitpod-desktop).
What's really nice about this is that with this kind of integration there's very little difference in setting up a dev env in the cloud or locally. The behaviour and qualities of those environments can differ vastly though (network bandwidth, latency, GPU, RAM, CPUs, ARM/x86).
"Hm, why does my Go service on a pod with 2.2 cpu's think it has 6k? Oh, it thinks it has the whole cluster. Nice; that is why scheduling has been an issue"
Hi Christian. We just deployed Gitpod EKS at our company in NY. Can we get some details on the replacement architecture? I’m sure it’s great but the devil is always in the details.
> I have seen several attempts to move dev environments to a remote host. They invariably suck.
To “therefore they will always suck and have no benefits and nobody should ever use them ever”. Apologies for the hyperbole but I’m making a point that comments like these tend to shut down interesting explorations of the state of the art of remote computing and what the pros/cons are.
Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.
But they don't suck because of lack of effort - they suck because there are real physical constraints.
Ex - even on a VERY good connection, RTT on the network is going to exceed your frame latency for a computer sitting in front of you (before we even get into the latency of the actual frame rendering of that remote computer). There's just not a solution for "make the light go faster".
Then we get into the issues the author actually laid out quite compellingly - Shared resources are unpredictable. Is my code running slowly right now because I just introduced an issue, or is it because I'm sharing an env and my neighbor just ate 99% of the CPU/IO, or my network provider has picked a different route and my latency just went up 500ms?
And that's before we even touch the "My machine is down/unreachable, I don't know why and I have no visibility into resolving the issue, when was my last commit again?" style problems...
> Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.
And this... is just bogus. We're not talking about machines running production data. We're talking about a developer environment. Sure - limit access to prod machines all you like, while you're at it, don't give me any production user data either - I sure as hell don't want it for local dev. What I do want is a fast system that I control so that I can actually tweak it as needed to develop and debug the system - it is almost impossible to give a developer "the least access needed" to do development locally because if you know what that access was you wouldn't be developing still.
> But they don't suck because of lack of effort - they suck because there are real physical constraints.
They do suck due to lack of effort or investment. FANG companies have remote dev experiences that are decent - or even great - because they invest obscene amounts into dev tooling.
There physical constraints on the flipside: especially for gigantic codebases or datasets that don't fit on dev laptops or have need lower latencies to other services in the DC.
Added bonus: smaller attack surface area for adversaries who want to gain access to your code.
At least with Google, they also have a data center near where most developers work, so that they have much lower latency.
They can't make the light go faster, but they can make it so it doesn't go as far. Smaller companies usually don't have a lot of flexibility with that though.
> Personally - just let the developer own the machine they use for development.
Overall I agree with you that this is how it should be, but as DevOps working with so many development teams, I can tell you that too many developers know a language or two but beyond that barely know how to use a computer. Most developers (yes even most of the ones in Silicon Valley or the larger Bay Area) with Macbooks will smile and nod at when you tell them that Docker Desktop runs a virtual machine to run a copy of Linux to run oci images, and then not too much later reveal themselves to have been clueless.
Commenters on this site are generally expected to be in a different category. Just wanted to share that, as a seasoned DevOps pro, I can tell you it's pretty rough out there.
This is an unpopular take, but entirely true. Skilled at a programming language, other than maybe C, does not in any way translate to general skill with system administration, or even knowing how to correctly operate a computer. I once had to explain to a dev that their Mac was out of disk space because a. They had never removed dangling containers or old image versions b. They had never emptied the Trash.
I think nowadays the value of source code is rarely a more valuable asset than the data being processed. Also I would prefer to give my devs just a second machine to run workloads and eventually pull in data or mock the data so they get moving more easily.
> The only real downside is data control (ie - the company has less control over how a developer manages assets like source code). ). I'm my experience, the vast majority of companies should worry less about this [...]
I once had to burn a ton of political capital (including some on credit), because someone who didn't understand software thought that cutting-edge tech startup software developers, even including systems programmers working close to metal, could work effectively using only virtual remote desktops... with a terrible VM configuration... from servers literally halfway around the world... through a very dodgy firewall and VPN... of 10Mb/s total bandwidth... for the entire office of dozens of developers.
(And no other Internet access from the VMs. Administrators would copy whatever files from the Internet that are needed for work. And there was a bureaucratic form for a human process, if you wanted to request any code/data to go in or out. And the laptops/workstations used only as thin-clients for the remote VMs would have to be Windows and run this ridiculous obscure 'endpoint security' software that had changed hands from its ancient developer, and hadn't even updated the marketing materials (e.g., a top bulletpoint was keeping your employees from wasting time on a Web site that famously was wiped out over a decade earlier), and presumably was littered with introduced vulnerabilities and instabilities.)
Note that this was not something like DoD, nor HIPAA, nor finance. Just cutting-edge tech on which (ironically) we wanted first-mover advantage.
This escalated to the other top-titled software engineer and I together doing a presentation to C-suite, on why not only would this kill working productivity (especially in a startup that needed to do creative work fast!), but the bad actors someone was paranoid about could easily circumvent it anyway to exfiltrate data (using methods obvious to the skilled software people like they hired, some undetectable by any security product or even human monitoring they imagined), and all the good rule-following people would quit in incredulous frustration.
Unfortunately, it might not have been even the CEO's call, but a crazy investor.
If your app fits on one machine, I agree with you: you absolutely should not use cloud dev environments in my opinion (and I've worked on large dev infra teams, that shipped cloud dev environments). The performance and latency of a Macbook Pro (or Framework 13, or whatever) is going to destroy cloud perf for development purposes.
If it doesn't fit on one machine, though, you don't have another option: Meta, for example, will never have a local dev env for Instagram or Blue. Then you need to make some hard choices.
Personally, my ideal cloud dev env is:
1. Local checkout of the code you're working on. You can use whatever IDE or text editor you prefer. For large monorepos, you'll need some special tooling to make sure it's easy to only check out slices of the repo.
2. Sync the code to the remote execution environment automatically, with hot-reloading.
3. Auto-port-forward from your local machine to the remote.
4. Optionally be able to run dependent services on your personal remote to debug/test their interactions with each other, and optionally be able to connect to a well-maintained shared environment for dependencies you aren't working on. If you have a shared environment, it can't be viewed as less-important than production: if it's broken, it's a SEV and the team that broke it needs to drop everything and fix it immediately. (Otherwise the shared env will be broken all the time, and your shipping speed will either drop, or you'll constantly be shipping bugs to prod due to lack of dev care.)
At Meta we didn't have (1): everyone had to use VSCode, with special in-house plugins that synced to the remote environment. It was okay but honestly a little soul-sucking; I think customizing your tooling is part of a lot of people's craft and helps maintain their flow state. Thankfully we had the rest, so it was tolerable if not enjoyable. At Airbnb we didn't have the political will to enforce (4), so the dev env was always broken. I think (4) is actually the most critical part: it doesn't matter how good the rest of it is, if the org doesn't care about it working.
But yeah — if you don't need it, that's a lot of work and politics. Use local environments as long as you possibly can.
> Personally - just let the developer own the machine they use for development.
It'll work if the company can offer something similar to EC2. Unfortunately most of the companies are not capable of doing so if they are not on cloud.
It is sorta like Vagrant, but instead of using virtualbox virtual machines you use podman containers. This way you get to use OCI images for your "dev environment" that integrates directly into your desktop.
There is some challenges related to usermode networking for non-root-managed controllers and desktop integration has some additional complications. But besides that it has almost no overhead and you can have unfettered access to things like GPUs.
Also it is usually pretty easy to convert your normal docker or kubernetes containers over to something you can run on your desktop.
Also it is possible to use things like Kubernetes pods definitions to deploy sets of containers with podman and manage it with systemd and such things. So you can have "clouds of containers" that your dev container needs access to locally.
If there is a corporate need for window-specific applications then running Windows VMs or doing remote applications over RDP is a possible work around.
If everything you are targeting as a deployment is going to be Linux-everything then it doesn't make a lot of sense to jump through a bunch of hoops and cause a bunch of headaches just to avoid having it as workstation OS.
If you're doing this, there are many cases where you might as well just spin up a decent Linux server and give your developers accounts on that? With some pretty basic setup everyone can just run their own stuff within their own user account.
You'll run into occasional issues (e.g. if everyone is trying to run default node.js on default port) but with some basic guardrails it feels like it should be OK?
I'm remembering back to when my old company ran a lot of PHP projects. Each user just had their own development environment and their own Apache vhost. They wrote their code and tested it in their own vhost. Then we'd merge to a single separate vhost for further testing.
I am trying to remember anything about what was painful about it but it all basically Just Worked. Everyone had remote access via VPN; the worst case scenario for them was they'd have to work from home with a bit of extra latency.
That's fine for some. However it's not always that. I wrote an entire site on my ipad in spare time with GitPods. Maybe you are at a small company with a small team so if things get critical you are likely to get a call. Do you say F'it, do you carry your laptop, or do you carry your ipad like you already are knowing you can still at least do triage if needed because you have a perfectly configured gitpod to use.
laughs in "Here's a VDI with 2vCPUs and 32GB of RAM but the cluster is overloaded, also you get to budget which IDEs you have installed because you have only a couple hundred GB of storage for everything including what we install on the base image that you will never use"
Sometimes I don't even use virtual envs when developing locally in Python. I just install everything that I need with pip --user and be done with it. Never had any conflicts with system packages whatsoever. If I somehow break my --user environment, I simply delete it and start again. Never had any major version mismatch in dependencies between my machine and what was running in production. At least not anything that would impact the actual task that I was working on.
I'm not recommending this as a best practice. I just believe that we, as developers, end up creating some myths to ourselves of what works and what doesn't. It's good to re-evaluate these beliefs now and then.
When doing this re-evaluation, please consider that others might be quietly working very hard to discover and recreate locally whatever secret sauce you and production share.
The only time I’ve had version issues running python code is that someone prior was referencing a deprecated library API or using an obscure package that shouldn’t see the light of day in a long lived project.
If you stick to the tried and true libs and change your function kwargs or method names when getting warnings, then I’ve had pretty rock steady reproducibility using even an un-versioned “python -m pip install -r requirements.txt” experience
I could also be a slob or just not working at the bleeding edge of python lib deployment tho so take it with a grain of salt.
Yeah, I know. But then you have to make sure that your IDE is using the correct environment, that the notebook is using the correct environment, that the debugger is using the correct environment.
It's trivial to setup a venv, but sometimes it's just not worth it for me.
This is one of the main reasons I tell people not to use VSCode. The people most likely to use it are juniors and people new to python specifically, and they're the most likely to fall victim to 'but my "IDE" says it's running 3.8 with everything installed, but when I run it from my terminal it's a different python 3.8'
I watched it last week. With 4 (I hope junior) Devs in a "pair programming" session that forced me to figure out how VSCode does virtual envs, and still I had to tell them like 3 times "stop opening a damn new terminal, it's obviously not setup with our python version, run the command inside the one that has the virtual env activated".
Weird, in my experience vscode makes it very clear by making you explicitly choose a .venv when running or debugging.
When it comes to opening a new terminal, you would have the exact same problem by... running commands in a terminal, cant see how vscode related that is.
> This is not a story of whether or not to use Kubernetes for production workloads that’s a whole separate conversation. As is the topic of how to build a comprehensive soup-to-nuts developer experience for shipping applications on Kubernetes.
> This is the story of how (not) to build development environments in the cloud.
I'd like to request that the comment thread not turn into a bunch of generic k8s complaints. This is a legitimately interesting article about complicated engineering trade-offs faced by an organization with a very unique workload. Let's talk about that instead of talking about the title!
Agreed. It's actually a very interesting use case and I can easily see that K8s wouldn't be the answer. My dev env is very definitely my "pet", thank you very much!
It'd be nice to editorialize the title a bit with "... (for dev envs)" for clarity.
Super useful negative example, and the lengths they pursued to make it fit! And no knock on the initial choice or impressive engineering, as many of the k8s problems they hit likely weren't understood gaps at the time they chose k8s.
Which makes sense, given k8s roots in (a) not being a security isolation tool & (b) targeting up-front configurability over runtime flexibility.
Neither of which mesh well with the co-hosted dev environment use case.
Can someone clarify if they mean development environments, or if they're talking about a service that they sell that's related to development environments.
Because I don't understand most of the article if it's the former. How are things like performance are a concern for internal development environments? And why are so many things stateful - ideally there should be some kind of configuration/secret management solution so that deployments are consistent.
If it's the latter, then this is incredibly niche and maybe interesting, but unlikely to be applicable to anyone else.
> This is not a story of whether or not to use Kubernetes for production workloads that’s a whole separate conversation. As is the topic of how to build a comprehensive soup-to-nuts developer experience for shipping applications on Kubernetes.
> This is the story of how (not) to build development environments in the cloud.
The article does a great job of explaining the challenges they ran into with Kubernetes, and some of the things they tried... but I feel like it drops the ball at the end by not telling us at least a little what they chose instead. The article mentions they call their new solution "Gitpod Flex" but there is nothing about what Gitpod Flex is. They said they tried microVMs and decided against them, and of course Kubernetes, the focus of the article. So is GitpodFlex based on full VM's? Docker? Some other container runtime??
Perhaps a followup article will go into detail about their replacement.
Yeah, that's fair. The blog was getting quite long, so we need to do some deeper dives in follow-ups.
Gitpod Flex is runner-based. The runner interface is intentionally generic so that we can support different clouds, on-prem or just Linux in future.
The first implemented runner is built around AWS primitives like EC2, EBS and ECS. But because of the more generic interface Gitpod now supports local / desktop environments on MacOS. And again, future OS support will come.
There’s a bit more information in the docs, but we will do some follow ups!
Awesome, looking forward to hearing more. I only recently began testing out Theia and OpenVSCodeServer, I really appreciate Gitpod's contributions to open source!
And that they're desperate to tell customers that they've fixed their problems.
Kubernetes is absolutely the wrong tool for this use case, and I argue that this should be obvious to someone in a CTO-level position, or their immediate advisors.
Kubernetes excels as a microservices platform, running reasonably trustworthy workloads. The key features of Kubernetes are rollout (highly available upgrades), elasticity (horizontal scaleout), bin packing (resource limits), CSI (dynamically mounted block storage), and so on. All this relates to a highly dynamic environment.
This is not at all what Gitpod needs. They need high performance disks, ballooning memory, live migrations, and isolated workloads.
Kubernetes does not provide you sufficient security boundaries for untrusted workloads. You need virtualization for that, and ideally physically separate machines.
Another major mistake they made was trying to build this on public cloud infrastructure. Of course the performance will be ridiculous.
However, one major reason for using Kubernetes is sharing the GPU. That is, to my knowledge, not possible with virtualization. But again, do you want to risk sharing your data, on a shared GPU?
I agree on the cloud thing. Don't agree that "high performance disks, ballooning memory, live migrations, and isolated workloads" preclude from using k8s - you can still run it as base layer. You get some central configuration storage, machine management and some other niceties for free and you can push your VM-specific features into your application pod. In fact, that's how Google Cloud is designed (except they use Borg not k8s but same idea).
True! I love the idea of using K8s to orchestrate the running of VMs. With graceful shutdown and distributed storage, it makes it even more trivial to semi-live migrate VMs.
Are you aware of the limits? It must run as root and privileged?
In this scenario k8s is orchestrating the hypervisor,
not VMs themselves. Hypervisor then orchestrates VMs + network (eg OVS) + other supporting functions (logs shipping, etc) on each individual “worker” node. VM scheduling/migration component needs to be completely decoupled from k8s apiserver (but itself can still run as normal k8s deployment) bc scaling kube api with unbound users is challenging. And yes, hypervisor will need to run privileged but you can limit it to worker nodes only
There are things that public cloud is great for.
Cost efficiency at high performance is not it.
For Gitpod, performance is critical to their product offering, because any latency in a dev environment is terrible UX.
Example: What performance do you get out of your NVMe disks?
Because these days you can build storage that delivers 100-200 GB/s.
For anything stateful, monolithic, or that doesn't require autoscaling, I find LXC more appropriate:
- it can be clusterized (LXD/Incus), like K8S but unlike Compose
- it exposes some tooling to the data plane, especially a load balancer, like K8S
- it offers system instances with a complete distribution and a init system, like a VM but unlike a Docker container
- it can orchestrate both VMs (including Windows VMs) and LXC containers at the same time in the same cluster
- LXC containers have the same performance as Docker containers unlike a VM
- it uses a declarative syntax
- it can be used as a foundation layer for anything stateful or stateless, including the Kubernetes cluster
LXD/Incus sits somewhere between Docker Swarm and a vCenter cluster, which makes it one of the most versatile platform. Nomad is also a nice contender, it cannot orchestrate LXC containers but can autoscale a variety of workloads, including Java apps and qemu VMs.
I too am rallying quickly to the Incus way of doing things. Also of note, there's an effort to build a utility to write Compose manifests for Incus workloads that I'm following very closely. https://github.com/bketelsen/incus-compose
I feel like anyone who was building a CI solution to sell to others and chose kubernetes didn't really understand the problem.
You're running hot pods for crypto miners and against people who really want to see the rest of the code that box has ever seen. You should be isolating with something purpose built like firecracker, and do your own dispatch & shred for security.
Firecracker is more comparable to container runtimes than to orchestrators such as K8s. You still need an orchestrator to schedule, manage and garbage-collect all your uVMs on top of your infrastructure exactly like you would do with containers via k8s. In other words, you will probably have to either use k8s or build your own k8s to run "supervisor" containers/processes that launch uVMs which in turn launch the customer dev containers.
For sure, but that's the point - containers aren't really good for an adversarial CI solution. You can run that shit in house on kubernetes on a VM in a simulated VR if you want. But if you have adversarial builds, you have a) builds that may well need close to root, and b) customers who may well want to break your shit. Containers are not the right solution for that, VM's get you mostly there, and the right answer is burning bare metal instances with fire after every change-of-tenant - but nobody does that (anymore), because VM's are close enough and it's faster to zero out a virtual disk than a real one.
So if you started with kubernetes and fought the whole process of why it's not a great solution to the problem, I have to assume you didn't understand the problem. I :heart: kubernetes, its complexity pays my bills - but it's barely a good CI solution when you trust everyone involved, it's definitely not a good one where you're trying to be general-purpose to everyone with a makefile.
I would argue that dev containers are more complicated than CI even though they share many of the challenges (e.g. devcontainers might need to load 10s or 100s of GBs to start and are write heavy). I would also argue that userns/rootless containers provide "enough" isolation when it comes to isolating CPU/memory/networking as well as access to the host's syscalls if you're careful enough; however when it comes to storage (e.g. max disk size that a container can use and write to, max opened files, completely hiding the host's fs from the container's, etc...), it's unfortunately still extremely limited,fs-dependent for some features, even though modern solutions (e.g. vDPA and ublk) can be used to fix that and virtualize the storage for containers.
I do agree with the points in article that k8s is not a good fit for development environments.
In my opinion, k8s is great for stable and consistent deployment/orchestration of applications. Dev environments by default are in a constant state of flux.
I don’t understand the need for “cloud development environments” though. Isn’t the point of containerized apps is to avoid the need for synchronizing dev envs amongst teams?
Or maybe this product is supposed to decrease onboarding friction?
It's to ensure a consistent environment for all developers, with the resources required. E.g. they mention GPUs, for developers working with GPU-intensive workloads. You can ship all developers gaming laptops with 64GB RAM and proper GPUs, and have them fight the environment to get the correct libraries as you have in prod (even with containers that's not trivial), or you can ship them Macbook Airs and similar, and have them run consistent (the same) dev environments remotely (you can self-host gitpod, it's not only a cloud service, it's more the API/environment to get consistent remote dev enviornments).
Yeah, exactly. Containers locally are a basic foundation. But usually those containers or services need to talk to one another, they need some form of auth and credentials, they need some networking setup. There's a lot of configuration in all of that. The more devs swap projects or the more complex the thing you're working on the more the challenge grows. Automating depedencies, secret access, ensuring projects have the right memory, cpu, gpu etc. Also security - moving source code off your laptop and devices and standardizing your setups helps if you need to do a lot of audit and compliance as you can automate it.
In my experience, the case where this becomes really valuable is if your team needs access to either different kinds of hardware or really expensive hardware that changes relatively quickly (i.e. GPUs). At a previous small startup I setup https://devpod.sh/ (similar to gitpod) for our MLE/Data team. It was a big pro to leverage our existing k8s setup w/ little configuration needed to get these developer envs up and running as-needed, and we could piggyback off of our existing cost tracking tooling to measure usage, but I do feel like we already had infra conducive to running dev envs on k8s before making this decision -- we had cost tracking tooling, we had a dedicated k8s cluster for tooling, we had already been supporting GPU based workloads in k8s, and our platform team that managed all the k8s infra also were the SMEs for anything devenv releated. In a world where we started fresh and absolutely needed ephemeral devenvs, I think the native devcontainer functionality in vscode or something like github codespaces would have been our go to, but even then I'd push for a docker-compose based workflow prior to touching any of these other tools.
The rest of our eng team just did dev on their laptops though. I do think there was a level of batteries-included-ness that came with the ephemeral dev envs which our less technical data scientists appreciated, but the rest of our developers did not. Just my 2c
Sarcastically, CDE is one way to move cost from CAPEX (get your developer a Mac Book Pro) to OPEX (a monthly subscription that you only need to pay as long as the dev has not been lay off)
It's also much cheaper to hire contractors and give them the CDE that can be terminated on a moment notice.
The original k8s paper mentioned that the only use case was a low latency and a high latency workflow combination and the resource allocation is based on that. The generic idea is that you can easily move low latency work between nodes and there are no serios repercussions when a high latency job fails.
Based on this information, it is hard to justify to even consider k8s for the problem that gitpod has.
I've worked on something similar to gitpod in a slightly different context that's part of a much bigger personal project related to secure remote access that I've actually spent a few years building now and hope to open source in a few months from now. While I agree on many of the points in the article, I just don't understand how using micro VMs by itself replaces K8s unless they actually start building their own K8s that orchestrates their micro VMs (as opposed to containers in the case of k8s) ending up with the same thing basically when k8s itself can be used to orchestrate the outer containers that run the micro VMs used to run the dev containers. Yes, k8s has many challenges when it comes to nesting containers, cgroups, creating rootless containers inside the outer k8s containers and other stuff such as multi-region scaling, but actually the biggest challenge that I've faced so far isn't related to networkPolicies or cgroups but is actually by far related to storage, both when it comes to (lazily) pulling big OCI images which are extremely unready to be used for dev containers whose sizes are typically in the GBs or 10s of GBs as well as also when it comes to storage virtualization over the underlying k8s node storage. There are serious attempts to accelerate image pulling (e.g. Nydus) but such solutions would still probably be needed whether you use micro VMs or rootless/userns containers in order to load and run your dev containers.
Phew, it is absolutely true. Building dev environments on k8s become wasteful. To add to this complexity, if you are building a product that is self hosted on customer's infrastructure. Debugging and support also become non homogeneous and difficult.
What we have seen works especially when you are building developer centric product is expose these native issues around network, memory, compute and storage to engineers and they are more willing to work around it. Abstracting those issues leads to shift in responsibility on the product.
Having said that, I still think k8s is an upgrade when you have a large team.
Our first implementation of brev.dev was built on top of kubernetes. We were also building a remote dev environment tool at the time. Treating dev environments like cattle seemed to be the wrong assumption. Turning kubernetes into a pet manager was a huge endeavor with long tail of issues. We rewrote our platform against vms and were immediately able to provide a better experience. Lots of tradeoffs but makes sense for dev envs.
I tried doing a dev environment on Kubernetes but the fact you have to be dealing with a set of containers that could change if the base layer changed meant instability in certain cases which threw me off.
I ended up with a mix of nix and it's vm build system which is based on qemu. The issue is too tied to NixOS and all services run in the same place which forces you to manage ports and other things.
How I wish it could work is having a flake that defines certain services, these services could or could not run in different µVMs sharing an isolated linux network layer. Your flake could define your versions, your commands to interact and manage the lifecyle of those µVM's. As the nix store can be cached/shared, it can be provide fast and reproducible builds after the first build.
Have you tried https://github.com/astro/microvm.nix ?
You can use the same NixOS module for both declarative VMs and imperatively configured and spawned VMs.
We've been using Nix flakes and direnv (https://direnv.net/) for developer environments and NixOS with https://github.com/serokell/deploy-rs for prod/deploys - takes serious digging and time to set up, but excellent experience with it so far.
Serious time to set up _and_ maintain as the project changes. At least, that was my experience. I really _want_ to have Nix-powered development environments, but I do _not_ want to spend the rest of my career maintaining them because developers refuse to "seriously dig" to understand how it works and why it decided to randomly break when they added a new dependency.
I think this approach works best in small teams where everyone agrees to drink the Nix juice. Otherwise, it's caused nothing but strife in my company.
I’ve been using Nix for the past year and it really feels like the holy grail for stable development environments. Like you said—it takes serious time to set up, but it seems like that’s an unavoidable reality of easily sharable dev envs.
The real reason for this shift is that kubernetes moved to containerd which they cannot handle. Docker was much easier. Differential workloads is not correct to blame.
Also, there is a long tail of issues to be fixed if you do it with Kubernetes.
Kubernetes does not just give you scaling, it gives you many things: run on any architecture, be close to your deployment etc.
> A simpler version of this setup is to use a single SSD attached to the node. This approach provides lower IOPS and bandwidth, and still binds the data to individual nodes.
Are you sure SSD is that slow? NVMe devices are so fast that I hardly believe there's any need for RAID 0.
In AWS iirc NVMe max out at 2GB/s - I'm not sure why that's the case. I know there were issues with the PCIe controller in the past being the bottleneck, but I suspect there's something more to it than that.
I was intrigued because the development environment problem is similar to the data scientist one - data gravity, GPU sharing, etc - but I'm confused on the solution?
Oddly, I left with a funny alternate takeaway: One by one, their clever inhouse tweaks & scheduling preferences were recognized by the community and turned into standard k8s knobs
So I'm back to the original question... What is fundamentally left? It sounds like one part is maintaining a clean container path to simplify a local deploy, which a lot of k8s teams do (ex: most of our enterprise customers prefer our docker compose & AMIs over k8s). But more importantly, something fundamental architecturally about how envs run that k8s cannot do, but they do not identify?
OP here. The Kubernetes community has been fantastic at evolving the platform, and we've greatly enjoyed being in the middle of it. Indeed, many of the things we had to build next to Kubernetes have now become part of k8s itself.
Still, some of the core challenges remain:
- the flexibility Kubernetes affords makes it hard to build and distribute a product with such specific requirements across the broad swath of differently set up Kubernetes installations. Managed Kubernetes services help, but come with their own restrictions (e.g. Kernel versions on GKE).
- state handling and storage remains unsolved. PVCs are not reliable enough, subject to a lot of variance (see point above), and depending on the backing storage have vastly different behaviour. Local disks (which we use to this day), make workspace startup and backup expensive from a resource perspective and hard to predict timing wise.
- user namespaces have come a long way in Kubernetes, but by themselves are not enough. /proc is still masked, FUSE is still not usable.
- startup times, specifically container pulls and backup restoration, are hard to optimize because they depend on a lot of factors outside of our control (image homogeneity, cluster configuration)
Fundamentally, Kubernetes simply isn't the right choice here. It's possible to make it work, but at some point the ROI of running on Kubernetes simply isn't there.
AFAICT, a lot of that comes down to storage abstractions, which I'll be curious to see the answer on! Pinned localstorage <> cloud native is frustrating.
I sense another big chunk is the fast secure start problems that firecracker (noted in the blogpost) solve but k8s is not currently equipped for. Our team has been puzzling that one for awhile, and part of our guess is incentives. It's been 5+ years since firecracker came out, so likewise been frustrating to see.
> We’ll be posting a lot more about Gitpod Flex architecture in the coming weeks or months. I’d love to invite you on November the 6th to a virtual event where I’ll be giving a demo of Gitpod Flex and I’ll deep-dive into the architecture and security model at length.
> Autoscaler plugins: In June 2022, we switched to using cluster-autoscaler plugins when they were introduced.
Does anyone have any links for cluster-autoscaler plugins? Searching drawing a blank, even in the cluster-autoscaler repo itself. Did this concept get ditched/removed?
I was wondering if there's productivity angle too. Take Ceph vs Rook for example. If a Ceph cluster needs all the resources on its machines and the cluster manages its resources too, then moving to Rook does not give any additional features. All the 50K additional lines of code in Rook is to set up CSIs and statefulsets and whatnot just to get Ceph working on Kubernetes.
The article is an excellent cautionary tale. Debugging an app in a container is one thing. Debugging and app running inside a Kubernetes node is a rabbit hole that demands more hours and expertise.
I can completely relate to anyone abandoning K8s. I'm working with dstack, an open-source alternative to K8s for AI infra [1]. We talk to many people who are frustrated with K8s, especially for GPU and AI workloads.
Kubernetes is awesome but I understand what the article is getting at. K8s was designed for a mostly homogeneous architecture when your platform requirements end with "deploy this service to my cluster" and you don't really care about the specifics of how it's scheduled.
A heterogeneous architecture with multi-tenancy poses some unique challenges because, as mentioned in the article, you get highly inconsistent usage patterns across different services. Also, arbitrary code execution (with sandboxing) can present a signifiant challenge. For security, you ideally need full isolation between services which belong to different users; this isolation wasn't a primary design goal of Kubernetes.
That said, you can probably still use K8s, but in a different way. For smaller customers, you could co-locate on the same cluster, but for larger customers which have high scalability requirements, you could have a separate K8s cluster for each one. Surely for such customers, it's worth the extra effort.
So in conclusion, I don't think the problems which were identified necessarily warrant abandoning K8s entirely, but maybe just a rethinking of how K8s is used. K8s still provides a lot of value in treating a whole cluster of computers as a single machine, especially if all your architecture is already set up for it. In addition to scheduling/orchestration, K8s offers a lot of very nice-to-have features like performance monitoring, dashboards, aggregated logs, ingress, health checks, ...
The problem with "development environments", like other interactive workloads, is that there is a human at the other end that desires a good interactive experience with every keypress. It's a radically different problem space than what k8s was designed for.
From a resource provider productive, the only way to squeeze a margin out of that space would be to reverse engineer 100% of human developer behavior so that you can ~perfectly predict "slack" in the system that could be reallocated to other users. Otherwise it's just a worse DX, like TFA gives examples of. Not a business I'm envious too be in... Just give everyone a dedicated VM or desktop, and make sure there's a batch system for big workloads.
I read this article and I still don't understand what's wrong with Kubernetes for this task. Everything you would do with virtual machines could be done with Kubernetes with very similar results.
I guess team just wants to rewrite everything, it happens. Manager should prevent that.
I also recently left Kubernetes. It was a huge waste of time and money. I've replaced it with just a series of services on Google Cloud Run and then using Google's Cloud Run Tasks services for longer running tasks.
The infrastructure now incredibly understandable and simple and cost effective.
Kubernetes cost us >$million in both DevOps time and actually Google Cloud costs unnecessarily, and even worse it cost us time to market. Stay off of Kubernetes as long as you can in your company, unless you are basically forced onto it. You should view it as an unnecessary evil that comes with massive downsides in terms of complexity and cost.
As far as I can tell, there actually is no AWS equivalent to GCP Cloud Run. The closest equivalents I know of are ECS on Fargate, which is more like managed Kubernetes except without Kubernetes compatibility or modern features, or AppRunner, which is closer in concept but also sorely lacking in comparable features.
3.) Could you please indicate the split of that 1 million $ in devops time and google cloud costs unnecessarily & were there some outliers (like oh our intern didn't add this specific variable and this misconfigured cloud and wasted 10k on gcloud oops! or was it , that bandwidth causes this much more in gcloud (I don't think latter to be the case though))
It is just a bunch of docker containers. Some run in tasks and some run as auto-scaling services. Would probably take a week to switch to AWS as there are equivalent managed services there.
But this is really a spurious concern. I myself used to care about it years ago. But in practice, rarely do people switch between cloud providers because the incremental benefits are minor, they are nearly equivalent, there is nothing much to be gained by moving from one to the other unless politics are involved (e.g. someone high up wants a specific provider.)
How does the orchestration work? How do you share storage? How do the docker containers know how to find each other? How does security work?
I feel like Kubernetes' downfall, for me, is the number of "enterprise" features it (got convinced into) supporting and enterprise features doing what they do best: turning the simplest of operations into a disaster.
I use managed DBs and Cloud Storage for shared storage. I think that provisioning your own SSDs/HDs to the cloud is indicative of an anti-pattern in your architecture.
> How do the docker containers know how to find each other?
You can configure Cloud Run services to be internal only and not to accept outside connections. Otherwise one can just use JWT or whatever is normal on your routes in your web server.
Yup domain mappings for now. There is some label support in Cloud Run but I haven’t explored it yet. You can also get the automatic domain name for a service via the cloud run tools.
Yeah I definitely want to also avoid a load balancer or gateway or end points as well for cost purposes.
One of Cloud Run's main advantages is that it's literally just telling it how to run containers. You could run those same containers in OpenFaaS, Lambda, etc relatively easily.
I'd investigate getting a build out to Node.js (looks like you already have this) and then just doing a simple SCP of the build to a VPS. From there, just use a systemd script to handle startup/restart on errors. For logging, something like the Winston package does the trick.
If you want some guidance, shoot me an email (in profile). You can run most stuff for peanuts.
> I'd investigate getting a build out to Node.js (looks like you already have this) and then just doing a simple SCP of the build to a VPS. From there, just use a systemd script to handle startup/restart on errors. For logging, something like the Winston package does the trick.
If you want some guidance, shoot me an email (in profile). You can run most stuff for peanuts.
I appreciate the offer! But it is not as robust and it is more expensive and misses a lot of benefits.
Back in the 1990s I did FTP my website to a VPS after I graduated from Geocities.
Google Cloud charges based on CPU used. Thus my servers have no traffic, they cost less than a $1/month. If they have traffic, they are still cost effective. https://web3dsurvey.com has about 500,000 hits per month and it costs me $4/month to run both the Remix web server and the Fastify API server. Details here: https://x.com/benhouston3d/status/1840811854911668641
Also it will autoscale under load. Thus when one of my posts was briefly the top story on Hacker News last month, Google Cloud Run added more instances to my server to handle the load (because I do not run my personal site behind a CDN, it cost too much, I prefer to pay $1/month for hosting.)
Also deploying Docker containers that build on Github Actions CI in a few minutes is a great automated experience.
I do also use Google services like Cloud Storage, Firestore, BigQuery etc. And it is easier to just run it on GCP infrastructure for speed.
I also have to version various tools that get installed in the docker like Blender, Chromium, etc. This is the perfect use case for Docker.
I feel this is pretty close to optimal. Fast, cheap, scalable, automated and robust.
there was some recent HN post which showed that they didn't even use docker but rather there was some other mechanism and it was so so simple , I really enjoyed that article
Or just https://github.com/mightymoud/sidekick or coolify or dokku or dockify , like there are million of such things , oh just remembered kamala deploy from DHH and docker swarm IIRC (though people have seemed to forget docker swarm !)
Google employee here. Not the case. Cloud Run doesn't run on Kubernetes. It supports the Knative interface which is an OSS project for Kubernetes-based serverless. But Cloud Run is a fully managed service that sits directly atop Borg (https://cloud.google.com/run/docs/securing/security).
The key is I am not managing Kubernetes and I am not paying for it - it is a fool's errand, and incredibly rarely needed. Who cares what is underneath the simple Cloud Run developer UX? What matters for me is cost, simplicity, speed and understandability. You get that with Cloud Run, and you don't with Kubernetes.
Personally - just let the developer own the machine they use for development.
If you really need consistency for the environment - Let them own the machine, and then give them a stable base VM image, and pay for decent virtualization tooling that they run... on their own machine.
I have seen several attempts to move dev environments to a remote host. They invariably suck.
Yes - that means you need to pay for decent hardware for your devs, it's usually cheaper than remote resources (for a lot of reasons).
Yes - that means you need to support running your stack locally. This is a good constraint (and a place where containers are your friend for consistency).
Yes - that means you need data generation tooling to populate a local env. This can be automated relatively well, and it's something you need with a remote env anyways.
---
The only real downside is data control (ie - the company has less control over how a developer manages assets like source code). I'm my experience, the vast majority of companies should worry less about this - your value as a company isn't your source code in 99.5% of cases, it's the team that executes that source code in production.
If you're in the 0.5% of other cases... you know it and you should be in an air-gapped closed room anyways (and I've worked in those too...)
And the reason they suck is the feedback loop is just too high as compared to running it locally. You have to jump through hoops to debug/troubleshoot your code or any issues that you come across between your code and output of your code. And it's almost impossible to work on things when you have spotty internet. I haven't worked on extremely sensitive data but for PII data from prod to dev, scrubbing is a good practice to follow. This will vary based on the project/team you're on of course.
Aka 'if a developer knew beforehand everything they needed, it wouldn't be development'
We tried this approach at a former company with ~600 engineers at the time.
Trying to boot the full service on a single machine required every single developer in the company installing ~50ish microservices on their machine, for things to work correctly. Became totally intractable.
I guess one can grumble about bad architecture all day but this had to be solved. we had to move to remote development environments which restored everyone’s sanity.
Both FAANG companies I’ve worked at had remote dev environments that were built in house.
Most teams/products I have been involved in, the stack always grows to the point that a dev can no longer test it on their own machine, regardless of how big the machine is. And having a different development machine than production leads to completely predictable and unavoidable problems. Devs need to create the software tooling to make remote dev less painful. I mean, they're devs... making software is kind of their whole thing.
I have used remote dev machines just fine, but my workflow vastly differs from many of my coworkers: terminal-only spacemacs + tmux + mosh. I have a lot of CLI and TUI tools, and I do not use VScode at all. The main GUI app I run is a browser, and that runs locally.
I have worked on developing VMs for other developers that rely on a local IDE such. The main sticking point is syncing and schlepping source code (something my setup avoids because the source code and editor is on the remote machine). I have tried a number of approaches, and I sympathize with the article author. So, in response to "Devs need to create the software tooling to make remote dev less painful. I mean, they're devs... making software is kind of their whole thing." <-- syncing and schlepping source code is by no means a solved problem.
I can also say that, my spacemacs config is very vanilla. Like my phone, I don't want to be messing with it when I want to code. Writing tooling for my editor environment is a sideshow for the work I am trying to finish.
I am hardly a dev but occasionally have had to do some or some scripting or web stuff and have really loved VSCode and using the remote SSH support to basically feel like I’m coding locally. Does that not work for your devs?
Me as well, specially in the days that there was only a UNIX dev server for everyone.
It was never an issue to use X Windows on them, with hummingbird on my Windows thin client.
I guess a new generation has to learn the ways of timesharing development.
We have a project which spawns around 80 Docker containers and runs pretty OK on a 5 year old Dell laptop with 16GB RAM. The fans run crazy and the laptop is always very hot but I haven't noticed considerable lags, even with IntelliJ running. Most services are written in Go though and are pretty lightweight.
"Most teams/products I have been involved in, the stack always grows to the point that a dev can no longer test it on their own machine"
Isn't this problem solved by CICD? When the developer is ready to test, they make a commit, and the pipeline deploys the code to a dev/test environment. That's how my teams have been doing it.
This turns a 1 hour task into a 1 day task. Fast feedback cycles are critical to software development.
I don't quite understand how people get into the situation where their work can't fit on their workstation. I've worked on huge projects at huge tech companies, and I could run everything on my workstation. I've worked at startups where the CI situation was passing 5% of the time and required 3 hours to run, that you can now run on your workstation in seconds. What you do is fix the stuff that doesn't fit.
The most insidious source of slowness I've encountered is tests that use test databases set to fsync = on. This severely limits parallelism and speed in a way that's difficult to diagnose; you have plenty of CPU and memory available, but the tests just aren't going very fast. (I don't remember how I stumbled upon this insight. I think I must have straced Postgres and been like "ohhhhhhhhh, of course".)
It's likely you haven't come across these use cases in your professional career, but I assure you its very common. My entire career has only seen projects where you need dozen to hundreds of CPU's in order to have a short feedback loop to verify the system works. I saw this in simple algorithms in automotive, to Advanced Driver Assistance Systems and machine learning applications.
When you are working on a software project that has 1,000 active developers checking in code daily and require a stable system build you need lots of compute.
> the stack always grows to the point that a dev can no longer test it on their own machine
So the solution here is to not have that kind of "stack".
I mean, if it's all so big and complex that it can't be run on a laptop then you almost certainly got a lot of problems regardless. What typically happens is tons of interconnected services without clear abstractions or interfaces, and no one really understands this spaghetti mess, and people just keep piling crap on top of it.
This leads to all sorts of problems. Everywhere I've seen this happen they had real problems running stuff in production too, because it was a complex spaghetti mess. The abstracted "easy" dev-env (in whatever form that came) is then also incredibly complex, finicky, and brittle. Never mind running tests, which is typically even worse. It's not uncommon for it all to be broken for every other new person who joins because changes somewhere broke the setup steps which are only run for new people. Everyone else is afraid to do anything with their machine "because it now works'.
There are some exceptions where you really need a big beefy machine for a dev env and tests, maybe, but they're few and far between.
> So the solution here is to not have that kind of "stack".
Reminds me of my favorites debugging technique. It's super fast: Don't write any bugs!
Really? I can't imagine not running the code locally. Honestly, my company has a micro services architecture, and I will just comment out the docker-compose pieces that I am not using. If I am developing/testing a particular component then I will enable it.
How tightly coupled are these systems?
In my last role as a director of engineering at a startup, I found that a project `flake.nix` file (coupled with simply asking people to use https://determinate.systems/posts/determinate-nix-installer/ to install Nix) led to the fastest "new-hire-to-able-to-contribute" time of anything I've seen.
Unfortunately, after a few hires (hand-picked by me), this is what happened:
1) People didn't want to learn Nix, neither did they want to ask me how to make something work with Nix, neither did they tell me they didn't want to learn Nix. In essence, I told them to set the project up with it, which they'd do (and which would be successful, at least initially), but forgot that I also had to sell them on it. In one case, a developer spent all weekend (of HIS time) uninstalling Nix and making things work using the "usual crap" (as I would call it), all because of an issue I could have fixed in probably 5 minutes if he had just reached out to me (which he did not, to my chagrin). The first time I heard them comment their true feelings on it was when I pushed back regarding this because I would have gladly helped... I've mentioned this on various Slacks to get feedback and people have basically said "you either insist on it and say it's the only supported developer-environment-defining framework, or you will lose control over it" /shrug
2) Developers really like to have control over their own machines (but I failed to assume they'd also want this control over the project dependencies, since, after all, I was the one who decided to control mine with the flake.nix in the first place!)
3) At a startup, execution is everything and time is possibly too short (especially if you have kids) to learn new things that aren't simple, even if better... that unfortunately may include Nix.
4) Nix would also be perfect for deployments... except that there is no (to my knowledge) general-purpose, broadly-accepted way to deploy via Nix, except to convert it to a Docker image and deploy that, which (almost) defeats most of the purpose of Nix.
I still believe in Nix but actually trying to use it to "perfectly control" a team's project dependencies (which I will insist it does do, pretty much, better than anything else) has been a mixed bag. And I will still insist that for every 5 minutes spent wrestling with Nix trying to get it to do what you need it to do, you are saving at least an order of magnitude more time spent debugging non-deterministic dependency issues that (as it turns out) were only "accidentally" working in the first place.
After my personal 2-year experiment with NixOS, I'd avoid anything Nix like the plague, and would be looking for a new job if anyone instituted a Nix-only policy.
It's not the learning new things that's a problem, but rather the fact that every little issue turns into a 2-day marathon that's eventually solved with a 1-line fix. And that's because the feedback loop and general UX is just awful - I really started to feel like I needed a sacrificial chicken.
Docker may be a dumpster fire, but at least it's generally easy to see what you did wrong and fix it.
People just straight-up don’t want to learn. There are always exceptions, of course, but IME the majority of people in tech are incurious. They want to do their job, and get paid. Reading man pages is sadly not in that list.
> They want to do their job, and get paid
Where "job" is defined in a narrowest way possible to assume minimum responsibility. Still want to get 200k+ salaries though...
This may sound extreme (it really isn't) but as Dr of Eng TP's job was to sus those folks out as early as possible and part ways (the kind where they go work for someone else). Some folks are completely irrational about their setups and no amount of appeasement in the form of "whys" and training is usually sufficient.
This has always made me sad, but I think you're right in a lot of cases. What I've always tried to do is to focus on basic productivity; make sure everyone has everything they need to do their work, and that most people do it in the same way, so you can make progress on the learning journey together. Whenever people ask me for help and want to set up a meeting (not just "please answer this on Slack and I'll leave you alone"), I record the meeting, try to touch on all the related areas of their problem, and then review the recording for things that would be interesting to write about it. If any of the digressions are interesting, I go into Notion, create a new page, and write up a couple paragraphs. Then I give my team "ever wonder what dynamic linking is and how to debug it?" and they can read it and know as much as I know.
I really, really struggle to deal with the fact that people don't know as much as I do (I wrote my first program when I was 4 and I'm 39 now), but I have accepted that it's not a weakness on their part, it's a weakness on my part. I wouldn't lower my standards (as a manager once suggested), but I do feel like it's my obligation to lead them on a journey of learning. That is to say, people don't learn without teaching, so be a teacher.
I think this, or something of equal complexity, is probably the right choice. I have spent a lot of time helping people with their dev environments, and the same problems keep coming up; "no, you need this version of kubectl", "no, you need this version of jq", "no, the Makefile expects THIS version of The Silver Searcher". A mass of shell scripts and random utilities was a consistent drag on the entire team and everyone that interacted with the team.
I ended up going with Bazel, not because of this particular problem alone (though it was part of it; people we hired spent WEEKS trying to get a happy edit/test/debug cycle going), but because proper dependency-based test caching was sorely needed. Using Bazel and Buildbuddy brought CI down from about 17 minutes per run to 3-4 minutes for a typical change, which meant that even if people didn't want to get a local setup going, they could at least be slightly productive. I also made sure that every dependency / tool useful for developing the product was versioned in the repository, so if something needs `psql` you can `bazel run //tools/postgres/psql` and have it just work. (Hate that Postgres can't be statically linked, though.)
It was a lot of work for me, and people do gripe about some things ("I liked `go test ./...`, I can't adjust to `bazel test ...`"), but all in all, it does work well. I would do it again. Day 1 at the company; git clone our thing, install bazelisk, and your environment setup is done. All the tests pass. You can run the app locally with a simple `bazel run`. I'm pretty happy with the outcome.
Nix is something I looked into for our container images, but they just end up being too big. I never figured out why; I think a lot of things are dynamically linked and they include their own /usr/lib tree with the entire transitive dependency chain for that particular app, even if other things you have installed have some overlap with that dependency chain. I prefer the approach of statically linking everything and only including what you need. I compromised by basing things on Debian and rules_distroless, which at least lets you build a container image with the exact same sha256 on two different machines. (We previously just did "FROM scratch; COPY <statically linked binary> /app; ENTRYPOINT /app", but then started needing things like pg_dump in our image. If you can just have a single statically-linked binary be your entire app, great. Sometimes you can't, and then you need some sort of reasonable solution. Also everything ends up growing a dependency on ca-certificates...)
From my perspective, installing Nix seems pretty invasive. I can understand if someone doesn't want to mess with their system "unnecessarily" especially if the tool and it's workings are foreign. And I can't really remember the last time I had issues with non-deterministic dependencies either. Dependency versions are locked. Maybe I'm missing something?
Typically when you start a new dev job the company will provide you with a pre-provisioned laptop that has their security stuff setup and maybe dev tools already installed, eg source code, compilers, VMs, Nix, and a supported editor, so it's not exactly a personal machine that they're messing with.
In a worse world, worse is better.
I think if you take about 80% of your comment and replace "Nix" with "Haskell/Lisp" and a few other techs, you'd basically have the same thing. Especially point #1.
Too true. I think there's a lot of people who don't want control; freedom is responsibility, as the saying goes, and responsibility can be stressful, even if it's liberating also.
Worse is better, sadly.
Try Devbox, you can basically ignore nix entirely and reap all the benefits.
Heh, yeah. You gotta put in writing that only userlands defined in Nix will be eligible to enter any environment beyond "dev". And (also put in writing) that their performance in the role will be partly evaluated on their ability to reach out for help with Nix when they need it.
OP here. There definitely is a place for running things on your local machine. Exactly as you say: one can get a great deal of consistency using VMs.
One of the benefits of moving away from Kubernetes, to a runner-based architecture , is that we can now seamlessly support cloud-based and local environments (https://www.gitpod.io/blog/introducing-gitpod-desktop).
What's really nice about this is that with this kind of integration there's very little difference in setting up a dev env in the cloud or locally. The behaviour and qualities of those environments can differ vastly though (network bandwidth, latency, GPU, RAM, CPUs, ARM/x86).
> The behaviour and qualities of those environments can differ vastly though (network bandwidth, latency, GPU, RAM, CPUs, ARM/x86).
For example, when you're running on your local machine you've actually got the amount of RAM and CPU advertised :)
"Hm, why does my Go service on a pod with 2.2 cpu's think it has 6k? Oh, it thinks it has the whole cluster. Nice; that is why scheduling has been an issue"
Hi Christian. We just deployed Gitpod EKS at our company in NY. Can we get some details on the replacement architecture? I’m sure it’s great but the devil is always in the details.
Hello. Currently debugging my kubernetes-based dev pod and not getting anything else done. What fun!
I’m not sure we should leap from:
> I have seen several attempts to move dev environments to a remote host. They invariably suck.
To “therefore they will always suck and have no benefits and nobody should ever use them ever”. Apologies for the hyperbole but I’m making a point that comments like these tend to shut down interesting explorations of the state of the art of remote computing and what the pros/cons are.
Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.
But they don't suck because of lack of effort - they suck because there are real physical constraints.
Ex - even on a VERY good connection, RTT on the network is going to exceed your frame latency for a computer sitting in front of you (before we even get into the latency of the actual frame rendering of that remote computer). There's just not a solution for "make the light go faster".
Then we get into the issues the author actually laid out quite compellingly - Shared resources are unpredictable. Is my code running slowly right now because I just introduced an issue, or is it because I'm sharing an env and my neighbor just ate 99% of the CPU/IO, or my network provider has picked a different route and my latency just went up 500ms?
And that's before we even touch the "My machine is down/unreachable, I don't know why and I have no visibility into resolving the issue, when was my last commit again?" style problems...
> Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.
And this... is just bogus. We're not talking about machines running production data. We're talking about a developer environment. Sure - limit access to prod machines all you like, while you're at it, don't give me any production user data either - I sure as hell don't want it for local dev. What I do want is a fast system that I control so that I can actually tweak it as needed to develop and debug the system - it is almost impossible to give a developer "the least access needed" to do development locally because if you know what that access was you wouldn't be developing still.
> But they don't suck because of lack of effort - they suck because there are real physical constraints.
They do suck due to lack of effort or investment. FANG companies have remote dev experiences that are decent - or even great - because they invest obscene amounts into dev tooling.
There physical constraints on the flipside: especially for gigantic codebases or datasets that don't fit on dev laptops or have need lower latencies to other services in the DC.
Added bonus: smaller attack surface area for adversaries who want to gain access to your code.
It isn't just the tooling though.
At least with Google, they also have a data center near where most developers work, so that they have much lower latency.
They can't make the light go faster, but they can make it so it doesn't go as far. Smaller companies usually don't have a lot of flexibility with that though.
> Personally - just let the developer own the machine they use for development.
Overall I agree with you that this is how it should be, but as DevOps working with so many development teams, I can tell you that too many developers know a language or two but beyond that barely know how to use a computer. Most developers (yes even most of the ones in Silicon Valley or the larger Bay Area) with Macbooks will smile and nod at when you tell them that Docker Desktop runs a virtual machine to run a copy of Linux to run oci images, and then not too much later reveal themselves to have been clueless.
Commenters on this site are generally expected to be in a different category. Just wanted to share that, as a seasoned DevOps pro, I can tell you it's pretty rough out there.
This is an unpopular take, but entirely true. Skilled at a programming language, other than maybe C, does not in any way translate to general skill with system administration, or even knowing how to correctly operate a computer. I once had to explain to a dev that their Mac was out of disk space because a. They had never removed dangling containers or old image versions b. They had never emptied the Trash.
I think nowadays the value of source code is rarely a more valuable asset than the data being processed. Also I would prefer to give my devs just a second machine to run workloads and eventually pull in data or mock the data so they get moving more easily.
> The only real downside is data control (ie - the company has less control over how a developer manages assets like source code). ). I'm my experience, the vast majority of companies should worry less about this [...]
I once had to burn a ton of political capital (including some on credit), because someone who didn't understand software thought that cutting-edge tech startup software developers, even including systems programmers working close to metal, could work effectively using only virtual remote desktops... with a terrible VM configuration... from servers literally halfway around the world... through a very dodgy firewall and VPN... of 10Mb/s total bandwidth... for the entire office of dozens of developers.
(And no other Internet access from the VMs. Administrators would copy whatever files from the Internet that are needed for work. And there was a bureaucratic form for a human process, if you wanted to request any code/data to go in or out. And the laptops/workstations used only as thin-clients for the remote VMs would have to be Windows and run this ridiculous obscure 'endpoint security' software that had changed hands from its ancient developer, and hadn't even updated the marketing materials (e.g., a top bulletpoint was keeping your employees from wasting time on a Web site that famously was wiped out over a decade earlier), and presumably was littered with introduced vulnerabilities and instabilities.)
Note that this was not something like DoD, nor HIPAA, nor finance. Just cutting-edge tech on which (ironically) we wanted first-mover advantage.
This escalated to the other top-titled software engineer and I together doing a presentation to C-suite, on why not only would this kill working productivity (especially in a startup that needed to do creative work fast!), but the bad actors someone was paranoid about could easily circumvent it anyway to exfiltrate data (using methods obvious to the skilled software people like they hired, some undetectable by any security product or even human monitoring they imagined), and all the good rule-following people would quit in incredulous frustration.
Unfortunately, it might not have been even the CEO's call, but a crazy investor.
Sounds like you are not using a lot of hardware - Rfid, POS, top-spec video cards, etc
If your app fits on one machine, I agree with you: you absolutely should not use cloud dev environments in my opinion (and I've worked on large dev infra teams, that shipped cloud dev environments). The performance and latency of a Macbook Pro (or Framework 13, or whatever) is going to destroy cloud perf for development purposes.
If it doesn't fit on one machine, though, you don't have another option: Meta, for example, will never have a local dev env for Instagram or Blue. Then you need to make some hard choices.
Personally, my ideal cloud dev env is:
1. Local checkout of the code you're working on. You can use whatever IDE or text editor you prefer. For large monorepos, you'll need some special tooling to make sure it's easy to only check out slices of the repo.
2. Sync the code to the remote execution environment automatically, with hot-reloading.
3. Auto-port-forward from your local machine to the remote.
4. Optionally be able to run dependent services on your personal remote to debug/test their interactions with each other, and optionally be able to connect to a well-maintained shared environment for dependencies you aren't working on. If you have a shared environment, it can't be viewed as less-important than production: if it's broken, it's a SEV and the team that broke it needs to drop everything and fix it immediately. (Otherwise the shared env will be broken all the time, and your shipping speed will either drop, or you'll constantly be shipping bugs to prod due to lack of dev care.)
At Meta we didn't have (1): everyone had to use VSCode, with special in-house plugins that synced to the remote environment. It was okay but honestly a little soul-sucking; I think customizing your tooling is part of a lot of people's craft and helps maintain their flow state. Thankfully we had the rest, so it was tolerable if not enjoyable. At Airbnb we didn't have the political will to enforce (4), so the dev env was always broken. I think (4) is actually the most critical part: it doesn't matter how good the rest of it is, if the org doesn't care about it working.
But yeah — if you don't need it, that's a lot of work and politics. Use local environments as long as you possibly can.
> Personally - just let the developer own the machine they use for development.
I wonder if Microsoft's approach for Dev Box is the right one.
Could you elaborate on what that approach is?
> Personally - just let the developer own the machine they use for development.
It'll work if the company can offer something similar to EC2. Unfortunately most of the companies are not capable of doing so if they are not on cloud.
I strongly recommend just switching the Dev environment over to Linux and taking advantage of tools like "distrobox" and "toolbx".
https://github.com/89luca89/distrobox
https://containertoolbx.org/
It is sorta like Vagrant, but instead of using virtualbox virtual machines you use podman containers. This way you get to use OCI images for your "dev environment" that integrates directly into your desktop.
https://podman.io/
There is some challenges related to usermode networking for non-root-managed controllers and desktop integration has some additional complications. But besides that it has almost no overhead and you can have unfettered access to things like GPUs.
Also it is usually pretty easy to convert your normal docker or kubernetes containers over to something you can run on your desktop.
Also it is possible to use things like Kubernetes pods definitions to deploy sets of containers with podman and manage it with systemd and such things. So you can have "clouds of containers" that your dev container needs access to locally.
If there is a corporate need for window-specific applications then running Windows VMs or doing remote applications over RDP is a possible work around.
If everything you are targeting as a deployment is going to be Linux-everything then it doesn't make a lot of sense to jump through a bunch of hoops and cause a bunch of headaches just to avoid having it as workstation OS.
If you're doing this, there are many cases where you might as well just spin up a decent Linux server and give your developers accounts on that? With some pretty basic setup everyone can just run their own stuff within their own user account.
You'll run into occasional issues (e.g. if everyone is trying to run default node.js on default port) but with some basic guardrails it feels like it should be OK?
I'm remembering back to when my old company ran a lot of PHP projects. Each user just had their own development environment and their own Apache vhost. They wrote their code and tested it in their own vhost. Then we'd merge to a single separate vhost for further testing.
I am trying to remember anything about what was painful about it but it all basically Just Worked. Everyone had remote access via VPN; the worst case scenario for them was they'd have to work from home with a bit of extra latency.
This.
Distrobox and podman are such a charm to use, and so easily integrated into dev environments and production environments.
The intentional daemon free concept is so much easier to setup in practice, as there's no fiddly group management necessary anymore.
Just a 5 line systemd service file and that's it. Easy as pie.
That's fine for some. However it's not always that. I wrote an entire site on my ipad in spare time with GitPods. Maybe you are at a small company with a small team so if things get critical you are likely to get a call. Do you say F'it, do you carry your laptop, or do you carry your ipad like you already are knowing you can still at least do triage if needed because you have a perfectly configured gitpod to use.
laughs in "Here's a VDI with 2vCPUs and 32GB of RAM but the cluster is overloaded, also you get to budget which IDEs you have installed because you have only a couple hundred GB of storage for everything including what we install on the base image that you will never use"
Sometimes I don't even use virtual envs when developing locally in Python. I just install everything that I need with pip --user and be done with it. Never had any conflicts with system packages whatsoever. If I somehow break my --user environment, I simply delete it and start again. Never had any major version mismatch in dependencies between my machine and what was running in production. At least not anything that would impact the actual task that I was working on.
I'm not recommending this as a best practice. I just believe that we, as developers, end up creating some myths to ourselves of what works and what doesn't. It's good to re-evaluate these beliefs now and then.
When doing this re-evaluation, please consider that others might be quietly working very hard to discover and recreate locally whatever secret sauce you and production share.
The only time I’ve had version issues running python code is that someone prior was referencing a deprecated library API or using an obscure package that shouldn’t see the light of day in a long lived project.
If you stick to the tried and true libs and change your function kwargs or method names when getting warnings, then I’ve had pretty rock steady reproducibility using even an un-versioned “python -m pip install -r requirements.txt” experience
I could also be a slob or just not working at the bleeding edge of python lib deployment tho so take it with a grain of salt.
I'm not going to second-guess what works for you, but Python makes it so easy to work with an ephemeral environment.
Yeah, I know. But then you have to make sure that your IDE is using the correct environment, that the notebook is using the correct environment, that the debugger is using the correct environment.
It's trivial to setup a venv, but sometimes it's just not worth it for me.
This is one of the main reasons I tell people not to use VSCode. The people most likely to use it are juniors and people new to python specifically, and they're the most likely to fall victim to 'but my "IDE" says it's running 3.8 with everything installed, but when I run it from my terminal it's a different python 3.8'
I watched it last week. With 4 (I hope junior) Devs in a "pair programming" session that forced me to figure out how VSCode does virtual envs, and still I had to tell them like 3 times "stop opening a damn new terminal, it's obviously not setup with our python version, run the command inside the one that has the virtual env activated".
Weird, in my experience vscode makes it very clear by making you explicitly choose a .venv when running or debugging.
When it comes to opening a new terminal, you would have the exact same problem by... running commands in a terminal, cant see how vscode related that is.
> This is not a story of whether or not to use Kubernetes for production workloads that’s a whole separate conversation. As is the topic of how to build a comprehensive soup-to-nuts developer experience for shipping applications on Kubernetes.
> This is the story of how (not) to build development environments in the cloud.
I'd like to request that the comment thread not turn into a bunch of generic k8s complaints. This is a legitimately interesting article about complicated engineering trade-offs faced by an organization with a very unique workload. Let's talk about that instead of talking about the title!
Agreed. It's actually a very interesting use case and I can easily see that K8s wouldn't be the answer. My dev env is very definitely my "pet", thank you very much!
It'd be nice to editorialize the title a bit with "... (for dev envs)" for clarity.
Super useful negative example, and the lengths they pursued to make it fit! And no knock on the initial choice or impressive engineering, as many of the k8s problems they hit likely weren't understood gaps at the time they chose k8s.
Which makes sense, given k8s roots in (a) not being a security isolation tool & (b) targeting up-front configurability over runtime flexibility.
Neither of which mesh well with the co-hosted dev environment use case.
Can someone clarify if they mean development environments, or if they're talking about a service that they sell that's related to development environments.
Because I don't understand most of the article if it's the former. How are things like performance are a concern for internal development environments? And why are so many things stateful - ideally there should be some kind of configuration/secret management solution so that deployments are consistent.
If it's the latter, then this is incredibly niche and maybe interesting, but unlikely to be applicable to anyone else.
4th paragraph in if you read the article…
> This is not a story of whether or not to use Kubernetes for production workloads that’s a whole separate conversation. As is the topic of how to build a comprehensive soup-to-nuts developer experience for shipping applications on Kubernetes.
> This is the story of how (not) to build development environments in the cloud.
I'm not sure that this really answers their question.
The article does a great job of explaining the challenges they ran into with Kubernetes, and some of the things they tried... but I feel like it drops the ball at the end by not telling us at least a little what they chose instead. The article mentions they call their new solution "Gitpod Flex" but there is nothing about what Gitpod Flex is. They said they tried microVMs and decided against them, and of course Kubernetes, the focus of the article. So is GitpodFlex based on full VM's? Docker? Some other container runtime??
Perhaps a followup article will go into detail about their replacement.
Yeah, that's fair. The blog was getting quite long, so we need to do some deeper dives in follow-ups.
Gitpod Flex is runner-based. The runner interface is intentionally generic so that we can support different clouds, on-prem or just Linux in future.
The first implemented runner is built around AWS primitives like EC2, EBS and ECS. But because of the more generic interface Gitpod now supports local / desktop environments on MacOS. And again, future OS support will come.
There’s a bit more information in the docs, but we will do some follow ups!
- https://www.gitpod.io/docs/flex/runners/aws/setup-aws-runner... - https://www.gitpod.io/docs/flex/gitpod-desktop
(I work at Gitpod)
Echoing the parent you're replying to. You built up all of the context and missed they payoff.
I thought it was fair.
>> We’ll be posting a lot more about Gitpod Flex architecture in the coming weeks or months.
Cramming more detail into this post would have exceeded the average user read time ceiling.
Awesome, looking forward to hearing more. I only recently began testing out Theia and OpenVSCodeServer, I really appreciate Gitpod's contributions to open source!
Still No idea what you did technically... Maybe a second post?
Did you use consul?
Sounds more to me like they need a new CTO.
And that they're desperate to tell customers that they've fixed their problems.
Kubernetes is absolutely the wrong tool for this use case, and I argue that this should be obvious to someone in a CTO-level position, or their immediate advisors.
Kubernetes excels as a microservices platform, running reasonably trustworthy workloads. The key features of Kubernetes are rollout (highly available upgrades), elasticity (horizontal scaleout), bin packing (resource limits), CSI (dynamically mounted block storage), and so on. All this relates to a highly dynamic environment.
This is not at all what Gitpod needs. They need high performance disks, ballooning memory, live migrations, and isolated workloads.
Kubernetes does not provide you sufficient security boundaries for untrusted workloads. You need virtualization for that, and ideally physically separate machines.
Another major mistake they made was trying to build this on public cloud infrastructure. Of course the performance will be ridiculous.
However, one major reason for using Kubernetes is sharing the GPU. That is, to my knowledge, not possible with virtualization. But again, do you want to risk sharing your data, on a shared GPU?
I agree on the cloud thing. Don't agree that "high performance disks, ballooning memory, live migrations, and isolated workloads" preclude from using k8s - you can still run it as base layer. You get some central configuration storage, machine management and some other niceties for free and you can push your VM-specific features into your application pod. In fact, that's how Google Cloud is designed (except they use Borg not k8s but same idea).
True! I love the idea of using K8s to orchestrate the running of VMs. With graceful shutdown and distributed storage, it makes it even more trivial to semi-live migrate VMs.
Are you aware of the limits? It must run as root and privileged?
In this scenario k8s is orchestrating the hypervisor, not VMs themselves. Hypervisor then orchestrates VMs + network (eg OVS) + other supporting functions (logs shipping, etc) on each individual “worker” node. VM scheduling/migration component needs to be completely decoupled from k8s apiserver (but itself can still run as normal k8s deployment) bc scaling kube api with unbound users is challenging. And yes, hypervisor will need to run privileged but you can limit it to worker nodes only
Why would you say that performance is bad on public cloud infrastructure?
There are things that public cloud is great for. Cost efficiency at high performance is not it. For Gitpod, performance is critical to their product offering, because any latency in a dev environment is terrible UX.
Example: What performance do you get out of your NVMe disks? Because these days you can build storage that delivers 100-200 GB/s.
https://www.graidtech.com/wp-content/uploads/2023/04/Results...
I bet few public cloud customers are seeing that kind of performance.
Kubernetes works great for stateless workloads.
For anything stateful, monolithic, or that doesn't require autoscaling, I find LXC more appropriate:
- it can be clusterized (LXD/Incus), like K8S but unlike Compose
- it exposes some tooling to the data plane, especially a load balancer, like K8S
- it offers system instances with a complete distribution and a init system, like a VM but unlike a Docker container
- it can orchestrate both VMs (including Windows VMs) and LXC containers at the same time in the same cluster
- LXC containers have the same performance as Docker containers unlike a VM
- it uses a declarative syntax
- it can be used as a foundation layer for anything stateful or stateless, including the Kubernetes cluster
LXD/Incus sits somewhere between Docker Swarm and a vCenter cluster, which makes it one of the most versatile platform. Nomad is also a nice contender, it cannot orchestrate LXC containers but can autoscale a variety of workloads, including Java apps and qemu VMs.
I too am rallying quickly to the Incus way of doing things. Also of note, there's an effort to build a utility to write Compose manifests for Incus workloads that I'm following very closely. https://github.com/bketelsen/incus-compose
I feel like anyone who was building a CI solution to sell to others and chose kubernetes didn't really understand the problem.
You're running hot pods for crypto miners and against people who really want to see the rest of the code that box has ever seen. You should be isolating with something purpose built like firecracker, and do your own dispatch & shred for security.
Firecracker is more comparable to container runtimes than to orchestrators such as K8s. You still need an orchestrator to schedule, manage and garbage-collect all your uVMs on top of your infrastructure exactly like you would do with containers via k8s. In other words, you will probably have to either use k8s or build your own k8s to run "supervisor" containers/processes that launch uVMs which in turn launch the customer dev containers.
For sure, but that's the point - containers aren't really good for an adversarial CI solution. You can run that shit in house on kubernetes on a VM in a simulated VR if you want. But if you have adversarial builds, you have a) builds that may well need close to root, and b) customers who may well want to break your shit. Containers are not the right solution for that, VM's get you mostly there, and the right answer is burning bare metal instances with fire after every change-of-tenant - but nobody does that (anymore), because VM's are close enough and it's faster to zero out a virtual disk than a real one.
So if you started with kubernetes and fought the whole process of why it's not a great solution to the problem, I have to assume you didn't understand the problem. I :heart: kubernetes, its complexity pays my bills - but it's barely a good CI solution when you trust everyone involved, it's definitely not a good one where you're trying to be general-purpose to everyone with a makefile.
I would argue that dev containers are more complicated than CI even though they share many of the challenges (e.g. devcontainers might need to load 10s or 100s of GBs to start and are write heavy). I would also argue that userns/rootless containers provide "enough" isolation when it comes to isolating CPU/memory/networking as well as access to the host's syscalls if you're careful enough; however when it comes to storage (e.g. max disk size that a container can use and write to, max opened files, completely hiding the host's fs from the container's, etc...), it's unfortunately still extremely limited,fs-dependent for some features, even though modern solutions (e.g. vDPA and ublk) can be used to fix that and virtualize the storage for containers.
I do agree with the points in article that k8s is not a good fit for development environments.
In my opinion, k8s is great for stable and consistent deployment/orchestration of applications. Dev environments by default are in a constant state of flux.
I don’t understand the need for “cloud development environments” though. Isn’t the point of containerized apps is to avoid the need for synchronizing dev envs amongst teams?
Or maybe this product is supposed to decrease onboarding friction?
It's to ensure a consistent environment for all developers, with the resources required. E.g. they mention GPUs, for developers working with GPU-intensive workloads. You can ship all developers gaming laptops with 64GB RAM and proper GPUs, and have them fight the environment to get the correct libraries as you have in prod (even with containers that's not trivial), or you can ship them Macbook Airs and similar, and have them run consistent (the same) dev environments remotely (you can self-host gitpod, it's not only a cloud service, it's more the API/environment to get consistent remote dev enviornments).
Yeah, exactly. Containers locally are a basic foundation. But usually those containers or services need to talk to one another, they need some form of auth and credentials, they need some networking setup. There's a lot of configuration in all of that. The more devs swap projects or the more complex the thing you're working on the more the challenge grows. Automating depedencies, secret access, ensuring projects have the right memory, cpu, gpu etc. Also security - moving source code off your laptop and devices and standardizing your setups helps if you need to do a lot of audit and compliance as you can automate it.
In my experience, the case where this becomes really valuable is if your team needs access to either different kinds of hardware or really expensive hardware that changes relatively quickly (i.e. GPUs). At a previous small startup I setup https://devpod.sh/ (similar to gitpod) for our MLE/Data team. It was a big pro to leverage our existing k8s setup w/ little configuration needed to get these developer envs up and running as-needed, and we could piggyback off of our existing cost tracking tooling to measure usage, but I do feel like we already had infra conducive to running dev envs on k8s before making this decision -- we had cost tracking tooling, we had a dedicated k8s cluster for tooling, we had already been supporting GPU based workloads in k8s, and our platform team that managed all the k8s infra also were the SMEs for anything devenv releated. In a world where we started fresh and absolutely needed ephemeral devenvs, I think the native devcontainer functionality in vscode or something like github codespaces would have been our go to, but even then I'd push for a docker-compose based workflow prior to touching any of these other tools.
The rest of our eng team just did dev on their laptops though. I do think there was a level of batteries-included-ness that came with the ephemeral dev envs which our less technical data scientists appreciated, but the rest of our developers did not. Just my 2c
Sarcastically, CDE is one way to move cost from CAPEX (get your developer a Mac Book Pro) to OPEX (a monthly subscription that you only need to pay as long as the dev has not been lay off)
It's also much cheaper to hire contractors and give them the CDE that can be terminated on a moment notice.
We started having a few developers have constant VSCode timeouts. We switched to GitHub devcontainers which have been great.
The original k8s paper mentioned that the only use case was a low latency and a high latency workflow combination and the resource allocation is based on that. The generic idea is that you can easily move low latency work between nodes and there are no serios repercussions when a high latency job fails.
Based on this information, it is hard to justify to even consider k8s for the problem that gitpod has.
I've worked on something similar to gitpod in a slightly different context that's part of a much bigger personal project related to secure remote access that I've actually spent a few years building now and hope to open source in a few months from now. While I agree on many of the points in the article, I just don't understand how using micro VMs by itself replaces K8s unless they actually start building their own K8s that orchestrates their micro VMs (as opposed to containers in the case of k8s) ending up with the same thing basically when k8s itself can be used to orchestrate the outer containers that run the micro VMs used to run the dev containers. Yes, k8s has many challenges when it comes to nesting containers, cgroups, creating rootless containers inside the outer k8s containers and other stuff such as multi-region scaling, but actually the biggest challenge that I've faced so far isn't related to networkPolicies or cgroups but is actually by far related to storage, both when it comes to (lazily) pulling big OCI images which are extremely unready to be used for dev containers whose sizes are typically in the GBs or 10s of GBs as well as also when it comes to storage virtualization over the underlying k8s node storage. There are serious attempts to accelerate image pulling (e.g. Nydus) but such solutions would still probably be needed whether you use micro VMs or rootless/userns containers in order to load and run your dev containers.
Phew, it is absolutely true. Building dev environments on k8s become wasteful. To add to this complexity, if you are building a product that is self hosted on customer's infrastructure. Debugging and support also become non homogeneous and difficult.
What we have seen works especially when you are building developer centric product is expose these native issues around network, memory, compute and storage to engineers and they are more willing to work around it. Abstracting those issues leads to shift in responsibility on the product.
Having said that, I still think k8s is an upgrade when you have a large team.
Our first implementation of brev.dev was built on top of kubernetes. We were also building a remote dev environment tool at the time. Treating dev environments like cattle seemed to be the wrong assumption. Turning kubernetes into a pet manager was a huge endeavor with long tail of issues. We rewrote our platform against vms and were immediately able to provide a better experience. Lots of tradeoffs but makes sense for dev envs.
I tried doing a dev environment on Kubernetes but the fact you have to be dealing with a set of containers that could change if the base layer changed meant instability in certain cases which threw me off.
I ended up with a mix of nix and it's vm build system which is based on qemu. The issue is too tied to NixOS and all services run in the same place which forces you to manage ports and other things.
How I wish it could work is having a flake that defines certain services, these services could or could not run in different µVMs sharing an isolated linux network layer. Your flake could define your versions, your commands to interact and manage the lifecyle of those µVM's. As the nix store can be cached/shared, it can be provide fast and reproducible builds after the first build.
Have you tried https://github.com/astro/microvm.nix ? You can use the same NixOS module for both declarative VMs and imperatively configured and spawned VMs.
> the fact you have to be dealing with a set of containers that could change if the base layer changed meant instability
Can you expand on this? Are you talking about containers you create?
We've been using Nix flakes and direnv (https://direnv.net/) for developer environments and NixOS with https://github.com/serokell/deploy-rs for prod/deploys - takes serious digging and time to set up, but excellent experience with it so far.
Serious time to set up _and_ maintain as the project changes. At least, that was my experience. I really _want_ to have Nix-powered development environments, but I do _not_ want to spend the rest of my career maintaining them because developers refuse to "seriously dig" to understand how it works and why it decided to randomly break when they added a new dependency.
I think this approach works best in small teams where everyone agrees to drink the Nix juice. Otherwise, it's caused nothing but strife in my company.
This may be the one area where some form of autocracy has merit :-)
I’ve been using Nix for the past year and it really feels like the holy grail for stable development environments. Like you said—it takes serious time to set up, but it seems like that’s an unavoidable reality of easily sharable dev envs.
The real reason for this shift is that kubernetes moved to containerd which they cannot handle. Docker was much easier. Differential workloads is not correct to blame.
Also, there is a long tail of issues to be fixed if you do it with Kubernetes.
Kubernetes does not just give you scaling, it gives you many things: run on any architecture, be close to your deployment etc.
https://github.com/Mirantis/cri-dockerd
Most of the kubernetes providers (GKE, EKS) do not support this new shim. Even on baremetal it is possibly hard to run.
> development environments
Kubernetes has never ever struck me as a good idea for a development environment. I'm surprised it took the author this long to figure out.
K8s can be a lifesaver for production, staging, testing, ... depending on your requirements and infrastructure.
> SSD RAID 0
> A simpler version of this setup is to use a single SSD attached to the node. This approach provides lower IOPS and bandwidth, and still binds the data to individual nodes.
Are you sure SSD is that slow? NVMe devices are so fast that I hardly believe there's any need for RAID 0.
In AWS iirc NVMe max out at 2GB/s - I'm not sure why that's the case. I know there were issues with the PCIe controller in the past being the bottleneck, but I suspect there's something more to it than that.
I was intrigued because the development environment problem is similar to the data scientist one - data gravity, GPU sharing, etc - but I'm confused on the solution?
Oddly, I left with a funny alternate takeaway: One by one, their clever inhouse tweaks & scheduling preferences were recognized by the community and turned into standard k8s knobs
So I'm back to the original question... What is fundamentally left? It sounds like one part is maintaining a clean container path to simplify a local deploy, which a lot of k8s teams do (ex: most of our enterprise customers prefer our docker compose & AMIs over k8s). But more importantly, something fundamental architecturally about how envs run that k8s cannot do, but they do not identify?
OP here. The Kubernetes community has been fantastic at evolving the platform, and we've greatly enjoyed being in the middle of it. Indeed, many of the things we had to build next to Kubernetes have now become part of k8s itself.
Still, some of the core challenges remain: - the flexibility Kubernetes affords makes it hard to build and distribute a product with such specific requirements across the broad swath of differently set up Kubernetes installations. Managed Kubernetes services help, but come with their own restrictions (e.g. Kernel versions on GKE). - state handling and storage remains unsolved. PVCs are not reliable enough, subject to a lot of variance (see point above), and depending on the backing storage have vastly different behaviour. Local disks (which we use to this day), make workspace startup and backup expensive from a resource perspective and hard to predict timing wise. - user namespaces have come a long way in Kubernetes, but by themselves are not enough. /proc is still masked, FUSE is still not usable. - startup times, specifically container pulls and backup restoration, are hard to optimize because they depend on a lot of factors outside of our control (image homogeneity, cluster configuration)
Fundamentally, Kubernetes simply isn't the right choice here. It's possible to make it work, but at some point the ROI of running on Kubernetes simply isn't there.
Thanks!
AFAICT, a lot of that comes down to storage abstractions, which I'll be curious to see the answer on! Pinned localstorage <> cloud native is frustrating.
I sense another big chunk is the fast secure start problems that firecracker (noted in the blogpost) solve but k8s is not currently equipped for. Our team has been puzzling that one for awhile, and part of our guess is incentives. It's been 5+ years since firecracker came out, so likewise been frustrating to see.
> We’ll be posting a lot more about Gitpod Flex architecture in the coming weeks or months. I’d love to invite you on November the 6th to a virtual event where I’ll be giving a demo of Gitpod Flex and I’ll deep-dive into the architecture and security model at length.
Bottom of the post.
> Autoscaler plugins: In June 2022, we switched to using cluster-autoscaler plugins when they were introduced.
Does anyone have any links for cluster-autoscaler plugins? Searching drawing a blank, even in the cluster-autoscaler repo itself. Did this concept get ditched/removed?
I was wondering if there's productivity angle too. Take Ceph vs Rook for example. If a Ceph cluster needs all the resources on its machines and the cluster manages its resources too, then moving to Rook does not give any additional features. All the 50K additional lines of code in Rook is to set up CSIs and statefulsets and whatnot just to get Ceph working on Kubernetes.
The article is an excellent cautionary tale. Debugging an app in a container is one thing. Debugging and app running inside a Kubernetes node is a rabbit hole that demands more hours and expertise.
Have folks seen success with https://earthly.dev/ as a tool in their dev cycle?
On a side note: has anybody experience with MicroK8s? I'd love to learn stories about it. I'm interested in both dev and production experiences.
I can completely relate to anyone abandoning K8s. I'm working with dstack, an open-source alternative to K8s for AI infra [1]. We talk to many people who are frustrated with K8s, especially for GPU and AI workloads.
[1] https://github.com/dstackai/dstack
I really like dstack, keep up the great work
You just simplified Kubernetes Management System
Kubernetes is awesome but I understand what the article is getting at. K8s was designed for a mostly homogeneous architecture when your platform requirements end with "deploy this service to my cluster" and you don't really care about the specifics of how it's scheduled.
A heterogeneous architecture with multi-tenancy poses some unique challenges because, as mentioned in the article, you get highly inconsistent usage patterns across different services. Also, arbitrary code execution (with sandboxing) can present a signifiant challenge. For security, you ideally need full isolation between services which belong to different users; this isolation wasn't a primary design goal of Kubernetes.
That said, you can probably still use K8s, but in a different way. For smaller customers, you could co-locate on the same cluster, but for larger customers which have high scalability requirements, you could have a separate K8s cluster for each one. Surely for such customers, it's worth the extra effort.
So in conclusion, I don't think the problems which were identified necessarily warrant abandoning K8s entirely, but maybe just a rethinking of how K8s is used. K8s still provides a lot of value in treating a whole cluster of computers as a single machine, especially if all your architecture is already set up for it. In addition to scheduling/orchestration, K8s offers a lot of very nice-to-have features like performance monitoring, dashboards, aggregated logs, ingress, health checks, ...
Leaving this comment here so I'll always come back to read this as someone who was considering kubernetes for a platform like gitpod
Remember that you can favorite posts.
The problem with "development environments", like other interactive workloads, is that there is a human at the other end that desires a good interactive experience with every keypress. It's a radically different problem space than what k8s was designed for.
From a resource provider productive, the only way to squeeze a margin out of that space would be to reverse engineer 100% of human developer behavior so that you can ~perfectly predict "slack" in the system that could be reallocated to other users. Otherwise it's just a worse DX, like TFA gives examples of. Not a business I'm envious too be in... Just give everyone a dedicated VM or desktop, and make sure there's a batch system for big workloads.
I read this article and I still don't understand what's wrong with Kubernetes for this task. Everything you would do with virtual machines could be done with Kubernetes with very similar results.
I guess team just wants to rewrite everything, it happens. Manager should prevent that.
I also recently left Kubernetes. It was a huge waste of time and money. I've replaced it with just a series of services on Google Cloud Run and then using Google's Cloud Run Tasks services for longer running tasks.
The infrastructure now incredibly understandable and simple and cost effective.
Kubernetes cost us >$million in both DevOps time and actually Google Cloud costs unnecessarily, and even worse it cost us time to market. Stay off of Kubernetes as long as you can in your company, unless you are basically forced onto it. You should view it as an unnecessary evil that comes with massive downsides in terms of complexity and cost.
As far as I can tell, there actually is no AWS equivalent to GCP Cloud Run. The closest equivalents I know of are ECS on Fargate, which is more like managed Kubernetes except without Kubernetes compatibility or modern features, or AppRunner, which is closer in concept but also sorely lacking in comparable features.
wow very very interesting. I think we can discuss about it on hours.
1.) What would you think of things like hetzner / linode / digitalocean (if stable work exists)
2.) What do you think of https://sst.dev/ or https://encore.dev/ ? (They support rather easier migration)
3.) Could you please indicate the split of that 1 million $ in devops time and google cloud costs unnecessarily & were there some outliers (like oh our intern didn't add this specific variable and this misconfigured cloud and wasted 10k on gcloud oops! or was it , that bandwidth causes this much more in gcloud (I don't think latter to be the case though))
Looking forward to chatting with you!
Aren't you afraid of being now stuck with GCP?
It is just a bunch of docker containers. Some run in tasks and some run as auto-scaling services. Would probably take a week to switch to AWS as there are equivalent managed services there.
But this is really a spurious concern. I myself used to care about it years ago. But in practice, rarely do people switch between cloud providers because the incremental benefits are minor, they are nearly equivalent, there is nothing much to be gained by moving from one to the other unless politics are involved (e.g. someone high up wants a specific provider.)
How does the orchestration work? How do you share storage? How do the docker containers know how to find each other? How does security work?
I feel like Kubernetes' downfall, for me, is the number of "enterprise" features it (got convinced into) supporting and enterprise features doing what they do best: turning the simplest of operations into a disaster.
> How does the orchestration work?
Github Actions CI. Take this and make a few more dependencies and a matrix strategy and you are good to go: https://github.com/bhouston/template-typescript-monorepo/blo... For dev environments, you can add post-fixes to the services based on branches.
> How do you share storage?
I use managed DBs and Cloud Storage for shared storage. I think that provisioning your own SSDs/HDs to the cloud is indicative of an anti-pattern in your architecture.
> How do the docker containers know how to find each other?
I try to avoid too much communication between services directly, rather try to go through pub-sub or similar. But you can set up each service with a domain name and access them that way. With https://web3dsurvey.com, I have an api on https://api.web3dsurvey.com and then a review environment (connected to the main branch) with https://preview.web3dsurvey.com / https://api.preview.web3dsurvey.com.
> How does security work?
You can configure Cloud Run services to be internal only and not to accept outside connections. Otherwise one can just use JWT or whatever is normal on your routes in your web server.
> But you can set up each service with a domain name and access them that way. Are you using Cloud Run domain mappings for this or something else?
I have been converging on a similar stack, but trying to avoid using a load balancer in an effort to keep fixed costs low.
Yup domain mappings for now. There is some label support in Cloud Run but I haven’t explored it yet. You can also get the automatic domain name for a service via the cloud run tools.
Yeah I definitely want to also avoid a load balancer or gateway or end points as well for cost purposes.
One of Cloud Run's main advantages is that it's literally just telling it how to run containers. You could run those same containers in OpenFaaS, Lambda, etc relatively easily.
What stack are you deploying?
Stuff like this, just at larger scale:
https://github.com/bhouston/template-typescript-monorepo
This is my living template of best practices.
I'd investigate getting a build out to Node.js (looks like you already have this) and then just doing a simple SCP of the build to a VPS. From there, just use a systemd script to handle startup/restart on errors. For logging, something like the Winston package does the trick.
If you want some guidance, shoot me an email (in profile). You can run most stuff for peanuts.
> I'd investigate getting a build out to Node.js (looks like you already have this) and then just doing a simple SCP of the build to a VPS. From there, just use a systemd script to handle startup/restart on errors. For logging, something like the Winston package does the trick. If you want some guidance, shoot me an email (in profile). You can run most stuff for peanuts.
I appreciate the offer! But it is not as robust and it is more expensive and misses a lot of benefits.
Back in the 1990s I did FTP my website to a VPS after I graduated from Geocities.
Google Cloud charges based on CPU used. Thus my servers have no traffic, they cost less than a $1/month. If they have traffic, they are still cost effective. https://web3dsurvey.com has about 500,000 hits per month and it costs me $4/month to run both the Remix web server and the Fastify API server. Details here: https://x.com/benhouston3d/status/1840811854911668641
Also it will autoscale under load. Thus when one of my posts was briefly the top story on Hacker News last month, Google Cloud Run added more instances to my server to handle the load (because I do not run my personal site behind a CDN, it cost too much, I prefer to pay $1/month for hosting.)
Also deploying Docker containers that build on Github Actions CI in a few minutes is a great automated experience.
I do also use Google services like Cloud Storage, Firestore, BigQuery etc. And it is easier to just run it on GCP infrastructure for speed.
I also have to version various tools that get installed in the docker like Blender, Chromium, etc. This is the perfect use case for Docker.
I feel this is pretty close to optimal. Fast, cheap, scalable, automated and robust.
there was some recent HN post which showed that they didn't even use docker but rather there was some other mechanism and it was so so simple , I really enjoyed that article
yeh I have same thoughts , also if possible , bun can also reduce memory usage in very very basic scenarios https://www.youtube.com/watch?v=yJmyYosyDDM
Or just https://github.com/mightymoud/sidekick or coolify or dokku or dockify , like there are million of such things , oh just remembered kamala deploy from DHH and docker swarm IIRC (though people have seemed to forget docker swarm !)
I like this idea very much !
You know that Cloud Run is effectively a Kubernetes PaaS, right?
Google employee here. Not the case. Cloud Run doesn't run on Kubernetes. It supports the Knative interface which is an OSS project for Kubernetes-based serverless. But Cloud Run is a fully managed service that sits directly atop Borg (https://cloud.google.com/run/docs/securing/security).
I guess the point is that for the OP, Kubernetes is now someone else's problem.
> You know that Cloud Run is a Kubernetes PaaS, right?
Yup. Isn't it Knative Serving or a home grown Google alternative to it? https://knative.dev/docs/serving/
The key is I am not managing Kubernetes and I am not paying for it - it is a fool's errand, and incredibly rarely needed. Who cares what is underneath the simple Cloud Run developer UX? What matters for me is cost, simplicity, speed and understandability. You get that with Cloud Run, and you don't with Kubernetes.