It means that a developer can use their relatively low-powered Apple device (with UMA) to develop for deployment on nvidia's relatively high-powered systems.
If Apple cannot do their own implementation of CUDA due to copyright second best is this; getting developers to build for LMX (which is on their laptops) and still get NVIDIA hardware support.
CUDA is not an API, it is a set of libraries written by NVIDIA. You'd have to reimplement those libraries, and for people to care at all you'd have to reimplement the optimizations in those libraries. That does get into various IP issues.
I want #3 be able to connect NVIDIA GPU with Apple Silicon and run CUDA. Take advantage of apple silicon + unified memory + GPU + CUDA with PyTorch, JAX or TensorFlow.
Haven’t really explored MLX so can’t speak about it.
2 also further cements CUDA as the de facto API to target, and nobody would write MLX targeted code instead.
This way, you’re more incentivized to write MLX and have it run everywhere. It’s a situation of everyone wins, especially Apple because they can optimize it further for their platforms.
If you're going "wait, no Apple platform has first-party CUDA support!", note that this set of patches also adds support for "Linux [platforms] with CUDA 12 and SM 7.0 (Volta) and up".
How does this work when one of the key features of MLX is using a unified memory architecture? (see bullets on repo readme: https://github.com/ml-explore/mlx )
I would think that bringing that to all UMA APUs (of any vendor) would be interesting, but discreet GPU's definitely would need a different approach?
edit: reading the PR comments, it appears that CUDA supports a UMA API directly, and will transparently copy as needed.
Eh yes but from my experience its lack of prefetch lends to significant memory stalls waiting for the copy. It might be suitable if your entire dataset fits in VRAM after doing a “manual prefetch” but it killed performance for my application (ML training) so hard that we actually got time to move to streaming loads.
Random aside: A lot of the people working on MLX don't seem to be officially affiliated with Apple at least in a superficial review. See for example: https://x.com/prince_canuma
Idly wondering, is Apple bankrolling this but wants to keep it in the DL? There were also rumours the team was looking to move at one point ?
I wonder how much this is a result of Strix Halo. I had a fairly standard stipend for a work computer that I didn't end up using for a while so I recently cashed it in on the EVO-X2 and fuck me sideways: that thing is easily competitive with the mid-range znver5 EPYC machines I run substitors on. It mops the floor with any mere-mortal EC2 or GCE instance, like maybe some r1337.xxxxlarge.metal.metal or something has an edge, but the z1d.metal and the c6.2xlarge or whatever type stuff (fast cores, good NIC, table stakes), blows them away. And those things are 3-10K a month with heavy provisioned IOPS. This thing has real NVME and it cost 1800.
I haven't done much local inference on it, but various YouTubers are starting to call the DGX Spark overkill / overpriced next to Strix Halo. The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).
Flawless CUDA on Apple gear would make it really tempting in a way that isn't true with Strix so cheap and good.
For the uninitiated, Strix Halo is the same as the AMD Ryzen AI Max+ 395 which will be in the Framework Desktop and is starting to show up in some mini PCs as well.
The memory bandwidth on that thing is 200GB/s. That's great compared to most other consumer-level x86 platforms, but quite far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the pro level cards) or even Apple's best (M3 Ultra has 800GB/s).
It certainly seems like a great value. But for memory bandwidth intensive applications like LLMs, it is just barely entering the realm of "good enough".
I’ve been very impressed with MLX models; I can open up local models to everyone in the house, something I wouldn’t dare with my Nvidia computer for the risk of burning down the house.
I’ve been hoping Apple Silicon becomes a serious contender for Nvidia chips; I wonder if the CUDA support is just Embrace, extend, and extinguish (EEE).
If Apple doubled the specs of their Ultra M processor every year, in numbers of cores, RAM cells, internal and external bandwidth, until both the Ultra processor and its RAM plane took up full wafers, .... but still fit in a Mac Studio case, with a new white reverse-power heat extraction USB-C+ cable designed to be terminated at a port on a small wireless heat exchanger dish, which instantly beamed all the waste heat into space, at such high efficiency that the Studio internals could operate at -100 Celsius, and all those cores overclocked, oh they over clocked, ...
Yes we can dream!
It would great if Apple continues pushing M processors to next levels, in part, to go vertical into the cloud.
Or if they start supporting nVidia.
The latter seems less Apple-y. But they must be considering the value of a cloud level Apple-friendly AI computing solution, so something is likely (?) to happen.
This is exciting. So this is using unified memory of CUDA? I wonder how well that works. Is the behavior of the unified memory in CUDA actually the same as for Apple silicon? For Apple silicon, as I understand, the memory is anyway shared between GPU and CPU. But for CUDA, this is not the case. So when you have some tensor on CPU, how will it end up on GPU then? This needs a copy somehow. Or is this all hidden by CUDA?
This is my guess, but does higher end hardware they sell, like the server rack stuff for AI, perhaps have the unified memory?
I know standard GPUs don’t.
The patch suggested one of the reasons for it was to make it easy to develop on a Mac and run on a super computer. So the hardware with the unified memory might be in that class.
> NVIDIA hardware is widely used for academic and massive computations. Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.
I think we're seeing the twilight of those efforts. Asahi Linux was an absolute powerhouse of reverse-engineering prowess, and it took years to get decent Vulkan coverage and half of the modern lineup's GPUs supported. Meanwhile AMD and even Intel are shipping Vulkan 1.3 drivers day-one on new hardware. It's a cool enthusiast effort to extend the longevity of the hardware, but it bears repeating; nobody is disrupting Nvidia's bottom-line here. Apple doesn't sell hardware competitive with Nvidia's datacenter hardware, and even if they did it's not supported by the community. It's doubtful that Apple would make any attempt to help them.
There seems to a pervading assumption that Apple is still making a VolksComputer in 2025, blithely supporting a freer status-quo for computing. They laid out their priorities completely with Apple Silicon, you're either on Apple's side or falling behind. Just the way they want it.
Seriously. Those Apple guys became delusional especially after Jobs passed away. These guys just sat on their successes and did nothing for a decade plus. M1 was nice but that was all Jobs doing and planning. I don’t like this Apple. They forgot how to innovate.
I wonder if Jensen is scared. If this opens up the door to other implementations this could be a real threat to Nvidia. CUDA on AMD, CUDA on Intel, etc. Might we see actual competition?
There’s no evidence of that. The post clearly identifies a far more probable reason in letting things be developed in Mac’s then deployed on Nvidia supercomputers.
Edit: I had the details of the Google v Oracle case wrong. SCOTUS found that re-implementing an API does not infringe copyright. I was remembering the first and second appellate rulings.
Also apparently this is not a re-implementation of CUDA.
This is exactly the kind of thing I wouldn’t opine on until like, an actual lawyer weighs in after thoroughly researching it. There are just too many shades of meaning in this kind of case law for laymen to draw actionable conclusions directly from the opinions.
Though I imagine that if Apple is doing this themselves, they likely know what they’re doing, whatever it is.
So to make sure I understand, this would mean:
1. Programs built against MLX -> Can take advantage of CUDA-enabled chips
but not:
2. CUDA programs -> Can now run on Apple Silicon.
Because the #2 would be a copyright violation (specifically with respect to NVidia's famous moat).
Is this correct?
It's 1.
It means that a developer can use their relatively low-powered Apple device (with UMA) to develop for deployment on nvidia's relatively high-powered systems.
That's nice to have for a range of reasons.
If Apple cannot do their own implementation of CUDA due to copyright second best is this; getting developers to build for LMX (which is on their laptops) and still get NVIDIA hardware support.
Apple should do a similar thing for AMD.
"relatively high powered"? there's nothing faster out there.
#2 is not a copyright violation. You can reimplement APIs.
CUDA is not an API, it is a set of libraries written by NVIDIA. You'd have to reimplement those libraries, and for people to care at all you'd have to reimplement the optimizations in those libraries. That does get into various IP issues.
The famous Android Java fight is probably the most important case of that discussion.
I don't think #2 is really true - AMDs HIP is doing this exact thing after giving up on OpenCL way back in ~'17/'18.
I want #3 be able to connect NVIDIA GPU with Apple Silicon and run CUDA. Take advantage of apple silicon + unified memory + GPU + CUDA with PyTorch, JAX or TensorFlow.
Haven’t really explored MLX so can’t speak about it.
No, it's because doing 2 would be substantially harder.
2 also further cements CUDA as the de facto API to target, and nobody would write MLX targeted code instead.
This way, you’re more incentivized to write MLX and have it run everywhere. It’s a situation of everyone wins, especially Apple because they can optimize it further for their platforms.
#2 would be Google v. Oracle wouldn’t it?
If you're going "wait, no Apple platform has first-party CUDA support!", note that this set of patches also adds support for "Linux [platforms] with CUDA 12 and SM 7.0 (Volta) and up".
https://ml-explore.github.io/mlx/build/html/install.html
It's coming from zcbenz who created Electron among others https://zcbenz.com/ Nice.
How does this work when one of the key features of MLX is using a unified memory architecture? (see bullets on repo readme: https://github.com/ml-explore/mlx )
I would think that bringing that to all UMA APUs (of any vendor) would be interesting, but discreet GPU's definitely would need a different approach?
edit: reading the PR comments, it appears that CUDA supports a UMA API directly, and will transparently copy as needed.
Eh yes but from my experience its lack of prefetch lends to significant memory stalls waiting for the copy. It might be suitable if your entire dataset fits in VRAM after doing a “manual prefetch” but it killed performance for my application (ML training) so hard that we actually got time to move to streaming loads.
Random aside: A lot of the people working on MLX don't seem to be officially affiliated with Apple at least in a superficial review. See for example: https://x.com/prince_canuma
Idly wondering, is Apple bankrolling this but wants to keep it in the DL? There were also rumours the team was looking to move at one point ?
> Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.
Does this means you can use MLX on linux now?
Edit:
Just tested it and it's working but only python 3.12 version is available on pypi right now: https://pypi.org/project/mlx-cuda/#files
> This PR is an ongoing effort to add a CUDA backend to MLX
looks like it allows MLX code to compile and run on x86 + GeForce hardware, not the other way around.
I wonder how much this is a result of Strix Halo. I had a fairly standard stipend for a work computer that I didn't end up using for a while so I recently cashed it in on the EVO-X2 and fuck me sideways: that thing is easily competitive with the mid-range znver5 EPYC machines I run substitors on. It mops the floor with any mere-mortal EC2 or GCE instance, like maybe some r1337.xxxxlarge.metal.metal or something has an edge, but the z1d.metal and the c6.2xlarge or whatever type stuff (fast cores, good NIC, table stakes), blows them away. And those things are 3-10K a month with heavy provisioned IOPS. This thing has real NVME and it cost 1800.
I haven't done much local inference on it, but various YouTubers are starting to call the DGX Spark overkill / overpriced next to Strix Halo. The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).
Flawless CUDA on Apple gear would make it really tempting in a way that isn't true with Strix so cheap and good.
For the uninitiated, Strix Halo is the same as the AMD Ryzen AI Max+ 395 which will be in the Framework Desktop and is starting to show up in some mini PCs as well.
The memory bandwidth on that thing is 200GB/s. That's great compared to most other consumer-level x86 platforms, but quite far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the pro level cards) or even Apple's best (M3 Ultra has 800GB/s).
It certainly seems like a great value. But for memory bandwidth intensive applications like LLMs, it is just barely entering the realm of "good enough".
It’s pretty explicitly targeting cloud cluster training in the PR description.
> The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).
Competitive AMD GPU neural compute has been any day now for at least 10 years.
how is it vs m4 mac mini?
I’ve been very impressed with MLX models; I can open up local models to everyone in the house, something I wouldn’t dare with my Nvidia computer for the risk of burning down the house.
I’ve been hoping Apple Silicon becomes a serious contender for Nvidia chips; I wonder if the CUDA support is just Embrace, extend, and extinguish (EEE).
If Apple would support Nvidia cards it would be the #1 solution for developers.
If Apple doubled the specs of their Ultra M processor every year, in numbers of cores, RAM cells, internal and external bandwidth, until both the Ultra processor and its RAM plane took up full wafers, .... but still fit in a Mac Studio case, with a new white reverse-power heat extraction USB-C+ cable designed to be terminated at a port on a small wireless heat exchanger dish, which instantly beamed all the waste heat into space, at such high efficiency that the Studio internals could operate at -100 Celsius, and all those cores overclocked, oh they over clocked, ...
Yes we can dream!
It would great if Apple continues pushing M processors to next levels, in part, to go vertical into the cloud.
Or if they start supporting nVidia.
The latter seems less Apple-y. But they must be considering the value of a cloud level Apple-friendly AI computing solution, so something is likely (?) to happen.
Apple is planing to build data centers with mseries of chips for both app development, testing and to host external services!
This is exciting. So this is using unified memory of CUDA? I wonder how well that works. Is the behavior of the unified memory in CUDA actually the same as for Apple silicon? For Apple silicon, as I understand, the memory is anyway shared between GPU and CPU. But for CUDA, this is not the case. So when you have some tensor on CPU, how will it end up on GPU then? This needs a copy somehow. Or is this all hidden by CUDA?
In the absence of hardware unified memory, CUDA will automatically copy data between CPU/GPU when there are page faults.
This is my guess, but does higher end hardware they sell, like the server rack stuff for AI, perhaps have the unified memory?
I know standard GPUs don’t.
The patch suggested one of the reasons for it was to make it easy to develop on a Mac and run on a super computer. So the hardware with the unified memory might be in that class.
Is this for Mac's with NVIDIA cards in them or Apple Metal/Apple Silicon speaking CUDA?... I can't really tell.
Edit: looks like it's "write once, use everywhere". Write MLX, run it on Linux CUDA, and Apple Silicon/Metal.
Seems you already found the answer.
I’ll note Apple hasn’t shipped an Nvidia card in a very very long time. Even on the Mac pros before Apple Silicon they only ever sold AMD cards.
My understanding from rumors is that they had a falling out over the problems with the dual GPU MacBook Pros and the quality of drivers.
I have no idea if sticking one in on the PCI bus let you use it for AI stuff though.
> "write once, use everywhere"
So my MLX workloads can soon be offloaded to the cloud!?
This is the only strategy humble me can see working for CUDA in MLX
Neither, it is for Linux computers with NVIDIA cards
Why is this a big deal, can anyone explain if they are familiar with the space?
> NVIDIA hardware is widely used for academic and massive computations. Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.
That one stands out to me as a mac user.
I thought you either use MLX for apple silicone or you compile it for cudaw
Now do linux support / drivers for Mac hardware!
I think we're seeing the twilight of those efforts. Asahi Linux was an absolute powerhouse of reverse-engineering prowess, and it took years to get decent Vulkan coverage and half of the modern lineup's GPUs supported. Meanwhile AMD and even Intel are shipping Vulkan 1.3 drivers day-one on new hardware. It's a cool enthusiast effort to extend the longevity of the hardware, but it bears repeating; nobody is disrupting Nvidia's bottom-line here. Apple doesn't sell hardware competitive with Nvidia's datacenter hardware, and even if they did it's not supported by the community. It's doubtful that Apple would make any attempt to help them.
There seems to a pervading assumption that Apple is still making a VolksComputer in 2025, blithely supporting a freer status-quo for computing. They laid out their priorities completely with Apple Silicon, you're either on Apple's side or falling behind. Just the way they want it.
Seriously. Those Apple guys became delusional especially after Jobs passed away. These guys just sat on their successes and did nothing for a decade plus. M1 was nice but that was all Jobs doing and planning. I don’t like this Apple. They forgot how to innovate.
But I guess we have a VR device nobody wants.
I wonder if Jensen is scared. If this opens up the door to other implementations this could be a real threat to Nvidia. CUDA on AMD, CUDA on Intel, etc. Might we see actual competition?
I think this is the other way around. It won't be cuda on anything except for nvidia.
However, this might make mlx into a much stronger competitor for Pytorch.
Why, everyone keeps trying to copy CUDA while failing to understand why many of us love it.
This instance is the other way around, but that's what this is – CUDA on AMD (or other platforms): https://docs.scale-lang.com/stable/
> CUDA backend
backend
Awesome
[dead]
that means the next apple computer is going to use nvidia gpu(s).
There’s no evidence of that. The post clearly identifies a far more probable reason in letting things be developed in Mac’s then deployed on Nvidia supercomputers.
but it's not an apple-submitted pr
Edit: I had the details of the Google v Oracle case wrong. SCOTUS found that re-implementing an API does not infringe copyright. I was remembering the first and second appellate rulings.
Also apparently this is not a re-implementation of CUDA.
You misunderstood and this is not re-implementing CUDA API.
MLX is a PyTorch-like framework.
This is exactly the kind of thing I wouldn’t opine on until like, an actual lawyer weighs in after thoroughly researching it. There are just too many shades of meaning in this kind of case law for laymen to draw actionable conclusions directly from the opinions.
Though I imagine that if Apple is doing this themselves, they likely know what they’re doing, whatever it is.
this is CUDA backend to MLX not MLX backend for CUDA!