Doesn’t surprise me at all that people who know what they’re doing are building their own images with nix for ML. Tens of millions of dollars have been wasted in the past 2 years by teams who are too stupid to upgrade from buggy software bundled into their “golden” docker container, or too slow to upgrade their drivers/kernels/toolkits. It’s such a shame. It’s not that hard.
The modern ML cards are much more raw than people realize. This isn’t a highly mature ecosystem with stable software, there are horrible bugs. It’s gotten better, but there are still big problems, and the biggest problem is that so many people are too stupid to use new releases with the fixes. They stick to the versions they already have because of nothing other than superstition.
Go look at the llama 3 whitepaper and look at how frequently their training jobs died and needed to be restarted. Quoting:
> During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions. Of these, 47 were planned interruptions due to automated maintenance operations such as firmware upgrades or operator-initiated operations like configuration or dataset updates. The remaining 419 were unexpected interruptions, which are classified in Table 5. Approximately 78% of the unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events.
The firmware and driver quality is not what people think it is. There’s also a lot of low-level software like NCCL and the toolkit that exacerbates issues in specific drivers and firmware versions. Grep for “workaround” in the NCCL source code and you’ll see some of them. It’s impossible to validate and test all permutations. It’s also worth mentioning that the drivers interact with a lot of other kernel subsystems. I’d point to HMM, the heterogeneous memory manager, which is hugely important for nvidia-uvm, which was only introduced in v6.1 and sees a lot of activity.
Or go look at the amount of patches being applied to mlx5. Not all of those patches get back ported into stable trees. God help you if your network stack uses an out of tree driver.
When you’re responsible for supporting people who refuse to receive patches like this one [1], and those same people have the power to page your phone at 11pm on the weekend… you quickly learn how to call a spade, a spade.
That wasn't what he used the word for. I understood his point perfectly: there are AI teams that are not knowledgeable or skilled enough to modify and enhance the docker images or toolkits that train/run the models. It takes some medium to advanced skills to get drivers to work properly. He used shorthand "too stupid to" instead of what I wrote above.
Honestly I've been loving systemd-nspawn using mkosi to build containers, distroless ones too at that where sensible. Works a treat for building vms too.
Scales wonderfully, fine grained permissions and configuration are exactly how you'd hope coming from systemd services. I appreciate it leverages various linux-isms like btrfs snapshots for faster read only or ephemeral containers.
People still by large have this weird assumption that you can only do OS containers with nspawn, never too sure where that idea came from.
Deploying computer programs isn't that hard. What you actually need to run is pretty straight forward. Depend on glibc, copypaste all your other shared lib dependencies and plop them in RPATH. Pretend `/lib` is locked at initial install. Remove `/usr/lib` from the path and include everything.
Docker was made because Linux sucks at running computer programs. Which is a very silly thing to be bad at. But here we are.
What has happened in more recent years is that CMake sucks ass so people have been abusing Docker and now Nix as build system. Blech.
The speaker does get it right at the end. A Bazel/Buck2 type solution is correct. An actual build system. They're abusing Nix and adding more layers to provide better caching. Sure, I guess.
If you step back and look at the final output of what needs to be produced and deployed it's not all that complicated. Especially if you get rid of the super complicated package dependency graph resolution step and simply vendor the versions of the packages you want. Which everyone should do, and a company like Anthropic should definitely do.
Doesn’t surprise me at all that people who know what they’re doing are building their own images with nix for ML. Tens of millions of dollars have been wasted in the past 2 years by teams who are too stupid to upgrade from buggy software bundled into their “golden” docker container, or too slow to upgrade their drivers/kernels/toolkits. It’s such a shame. It’s not that hard.
Edit: see also the horrors that exist when you mix nvidia software versions: https://developer.nvidia.com/blog/cuda-c-compiler-updates-im...
I use Nix and like it, but in terms of DX docker is still miles ahead. I liken it to Python vs Rust. Use the right tool.
Can you be explicit in how the dollars are being wasted? Maybe it's obvious to you but omjow does an old kernel waste money?
The modern ML cards are much more raw than people realize. This isn’t a highly mature ecosystem with stable software, there are horrible bugs. It’s gotten better, but there are still big problems, and the biggest problem is that so many people are too stupid to use new releases with the fixes. They stick to the versions they already have because of nothing other than superstition.
Go look at the llama 3 whitepaper and look at how frequently their training jobs died and needed to be restarted. Quoting:
> During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions. Of these, 47 were planned interruptions due to automated maintenance operations such as firmware upgrades or operator-initiated operations like configuration or dataset updates. The remaining 419 were unexpected interruptions, which are classified in Table 5. Approximately 78% of the unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events.
The firmware and driver quality is not what people think it is. There’s also a lot of low-level software like NCCL and the toolkit that exacerbates issues in specific drivers and firmware versions. Grep for “workaround” in the NCCL source code and you’ll see some of them. It’s impossible to validate and test all permutations. It’s also worth mentioning that the drivers interact with a lot of other kernel subsystems. I’d point to HMM, the heterogeneous memory manager, which is hugely important for nvidia-uvm, which was only introduced in v6.1 and sees a lot of activity.
Or go look at the amount of patches being applied to mlx5. Not all of those patches get back ported into stable trees. God help you if your network stack uses an out of tree driver.
It always cracks me up when people use the word "stupid" to insult other's intelligence. What a pathetically low-effort word to use.
When you’re responsible for supporting people who refuse to receive patches like this one [1], and those same people have the power to page your phone at 11pm on the weekend… you quickly learn how to call a spade, a spade.
[1]: https://patchwork.ozlabs.org/project/ubuntu-kernel/patch/202...
That wasn't what he used the word for. I understood his point perfectly: there are AI teams that are not knowledgeable or skilled enough to modify and enhance the docker images or toolkits that train/run the models. It takes some medium to advanced skills to get drivers to work properly. He used shorthand "too stupid to" instead of what I wrote above.
New corollary: sometimes new tech gets built because you don't know how to correctly use existing tech.
Are you referring to this Nix effort or to Docker? Because that largely applies to most usages of Docker.
Great, can't wait for the systemd crew come out with: Docker was Too Slow, So We Replaced It: Systemd in Production [asciinema]
No joke, it's already there, systemd-nspawn can run OCI containers.
Honestly I've been loving systemd-nspawn using mkosi to build containers, distroless ones too at that where sensible. Works a treat for building vms too.
Scales wonderfully, fine grained permissions and configuration are exactly how you'd hope coming from systemd services. I appreciate it leverages various linux-isms like btrfs snapshots for faster read only or ephemeral containers.
People still by large have this weird assumption that you can only do OS containers with nspawn, never too sure where that idea came from.
Some people, when confronted with a problem, think "I know, I'll use Nix." Now they have two problems.
Semms loke anti-intellectualism is spreading at HN too.
Yup, and there's a high correlation between people rewriting everything in Rust and converting everything else to Nix. It's like a complexity fetish.
"your entire static website is running on GitHub pages? Sounds like legacy tech debt. I need to replace it with kubernetes pronto"
The change with some engineers is a bit that if there's no user problem to solve, they'll happily solve some hypothetical problem.
Having said that, my weekend project was "upgrading" my RSS reader to run HA on Kubernetes.
Alternative: just produce relocatable builds that don’t require all of this unnecessary extra infrastructure
Please elaborate. How does one "just" do that?
Deploying computer programs isn't that hard. What you actually need to run is pretty straight forward. Depend on glibc, copypaste all your other shared lib dependencies and plop them in RPATH. Pretend `/lib` is locked at initial install. Remove `/usr/lib` from the path and include everything.
Docker was made because Linux sucks at running computer programs. Which is a very silly thing to be bad at. But here we are.
What has happened in more recent years is that CMake sucks ass so people have been abusing Docker and now Nix as build system. Blech.
The speaker does get it right at the end. A Bazel/Buck2 type solution is correct. An actual build system. They're abusing Nix and adding more layers to provide better caching. Sure, I guess.
If you step back and look at the final output of what needs to be produced and deployed it's not all that complicated. Especially if you get rid of the super complicated package dependency graph resolution step and simply vendor the versions of the packages you want. Which everyone should do, and a company like Anthropic should definitely do.
Docker is overkill if all you really need is app packaging.
Docker containers may not be portable anyway when the CUDA version used in the container has to match the kernel driver and GPU firmware, etc.