CUDA Ontology

(jamesakl.com)

130 points | by gugagore 4 days ago ago

18 comments

w-m 3 hours ago

This is a good resource. But for the computer vision and machine learning practitioner most of the fun can start where this article ends.

nvcc from the CUDA toolkit has a compatibility range with the underlying host compilers like gcc. If you install a newer CUDA toolkit on an older machine, likely you'll need to upgrade your compiler toolchain as well, and fix the paths.

While orchestration in many (research) projects happens from Python, some depend on building CUDA extensions. An innocently looking Python project may not ship the compiled kernels and may require a CUDA toolkit to work correctly. Some package management solutions provide the ability to install CUDA toolkits (conda/mamba, pixi), the pure-Python ones do not (pip, uv). This leaves you to match the correct CUDA toolkit to your Python environment for a project. conda specifically provides different channels (default/nvidia/pytorch/conda-forge), from conda 4.6 defaulting to a strict channel priority, meaning "if a name exists in a higher-priority channel, lower ones aren't considered". The default strict priority can make your requirements unsatisfiable, even though there would be a version of each required package in the collection of channels. uv is neat and fast and awesome, but leaves you alone in dealing with the CUDA toolkit.

Also, code that compiles with older CUDA toolkit versions may not compile with newer CUDA toolkit versions. Newer hardware may require a CUDA toolkit version that is newer than what the project maintainer intended. PyTorch ships with a specific CUDA runtime version. If you have additional code in your project that also is using CUDA extensions, you need to match the CUDA runtime version of your installed PyTorch for it to work. Trying to bring up a project from a couple of years ago to run on latest hardware may thus blow up on you on multiple fronts.

[-]

alecco a few seconds ago

> nvcc from the CUDA toolkit has a compatibility range with the underlying host compilers like gcc. If you install a newer CUDA toolkit on an older machine, likely you'll need to upgrade your compiler toolchain as well, and fix the paths.

Conversely, nvcc often stops working with major upgrades of gcc/clang. Fun times, indeed.

This is why a lot of people just use NVIDIA's containers even for local solo dev. It's a hassle to set up initially (docker/podman hell) but all the tools are there and they work fine.

eapriv an hour ago

Sounds like most of these problems come from using Python.

[-]

mellosouls an hour ago

You imply these problems would go away (or wouldn't be replaced by new ones) with another language.

anotherpaul 3 hours ago

Yes, this is the actual lived reality. Thank you for outlining it so well.

visarga 4 hours ago

Wondering why a $4T company can't afford a smart installation assistant that can auto-detect problems and apply fixes as needed. I wasted too many days chasing driver and torch versions. It's probably the worst part of working in ML. Combine this with Python's horrible package management and you got a perfect combo - like the cough and the stitch.

[-]

fragmede 27 minutes ago

Just have claude code fix it

numbers_guy 4 hours ago

They provide containers to cater to those needs: https://catalog.ngc.nvidia.com/search

[-]

threeducks 3 hours ago

After being once again frustrated by the CUDA installation experience, I thought that I should give those containers a try. Unfortunately, my computer did not boot anymore after following the installation instructions for the NVIDIA container toolkit as outlined on the NVIDIA website. Reinstalling everything and following the instructions from some random blog post made it work, but I then found that the container with the CUDA version that I needed had been deprecated.

There were other problems, such as the research cluster of my university not having Docker, but that is a different issue.

YetAnotherNick 2 hours ago

Containers don't include drivers which is the primary reason for issues.

[-]

torginus 2 hours ago

Containers afair rely on the exact driver version matching between the host system and the container itself.

We were on AWS when we used this so setting up seemed easy enough - AWS gave you the driver, and a matching docker image was easy enough to find.

bbx an hour ago

For reference: CUDA means "Compute Unified Device Architecture".

pjmlp 3 hours ago

Great overview, with lots of effort place into it.

However, it misses the polyglot part (Fortran, Python GPU JIT, all the backends that support PTX), the library ecosystem (writing CUDA kernels should be the exception not the rule), the graphical debugging tools and IDE integration.

montyanderson 21 minutes ago

this is fantastic

ArcHound 4 hours ago

That is a great reference, explains a lot of small inaccuracies between various tutorials when you're trying to debug some of these issues. Saved and printed, thanks a lot!

virajk_31 3 hours ago

thanks for the kernel nomenclatures

zvr 3 hours ago

Great explanation!

It should probably also add that everything CUDA is owned by NVIDIA, and "CUDA" itself is a registered trademark. The official way to refer to it is that the first time you spell it out as "NVIDIA® CUDA®" and then subsequently refer to just CUDA.

[-]

threeducks 3 hours ago

Why should the author use the registered trademark symbol?