(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)
I wouldn't say that they aren't useful for inference (there are pretty clear performance improvements even from the asahi effort you linked) - it's just that you have to convert the model ahead of time to be compatible with the ANE which is explained in the readme docs for whisper.cpp that I linked above.
I would say though that this likely excludes them from being useful for training purposes.
Note that I was only commenting on modern quantized LLM's that basically avoid formats like FP16 or INT8, preferring lower precision wherever feasible. When in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. So the only feasible benefits are really in the prompt pre-processing phase, and even then only in lower power use compared to GPU, not really in higher speed.
That's really interesting! I didn't know that about the padding behavior here. I am interested to know which models this would include? I know Gemma 3 raw is bf16 - are you just talking about the quantized versions of these? Or are models being released purely as quantized versions these days? I know Google just released a QAT (Quantization Aware Training) model of Gemma 3 27b - but that base model was already released.
Models may be released as unquantized (and even then they are gradually shifting towards lower precisions over time), but most people are going to be running them in a quantized version simply because that gives you the best bang for your buck (you can fit more interesting models on the same hardware). Of course this is strictly about local LLM inference, though one may reasonably assume that the big players are also doing something similar.
What do you mean by less wide? The main bottleneck for transformers is memory bandwidth. ANE has a much lower ceiling than CPU/GPU (yes, despite unified memory).
Chunking is actually beneficial as long as all the chunks can fit into the ANE’s cache. It speeds up compilation for large network graphs and cached loads are negligible cost. On M1 the cache limit is 3-4GB, but it is higher on M2+.
I was referring to both the lower memory bandwidth and lower FLOPs. The GPU can just do… more at once? For now. Or is that changing?
I had also assumed that loading a chunk from the cache was not free because I’ve seen cache eviction on my M1, but it’s good to know that it’s no longer as big of a limitation.
also, I’m a big fan of your work! I played around with your ModernBERT CoreML port a bit ago
For single batch inference of anything remotely LLM you'll hit the memory bound way before FLOPs, so I haven't actually looked at FLOPs much. For raw performance GPU is certainly better. ANE is more energy efficient, but you need larger batches to really benefit.
Maybe cache is the wrong word. This is a limit to how much can be mmap'd for the ANE at once. It's not too hard to hit on M1 if your model is in the GB range. Chunking the model into smaller pieces makes it more likely to "fit", but if it doesn't fit you have to unmap/remap in each forward pass which will be noticeable.
Awesome to hear about ModernBERT! Big fan of your work as well :)
Actually that's a really good question, I hadn't considered that the comparison here is just CPU vs using Metal (CPU+GPU).
To answer the question though - I think this would be used for cases where you are building an app that wants to utilize a small AI model while at the same time having the GPU free to do graphics related things, which I'm guessing is why Apple stuck these into their hardware in the first place.
Not switching to the CPU - switching to the ANE (Neural Cores) - if you read the research papers Apple has released - the example I gave is pretty much how it's being used - small image classification models running on the ANE, alongside a graphics app that needs the GPU to be free.
One update from Apple Research since that one was "Deploying Attention-Based Vision Transformers to Apple Neural Engine" (but not clear the relationship. It doesn't build on ane_transformers, but maybe a sister project for vision?)
Not a public follow-up but the iOS 17 speech-to-text model has a clever approach to KV caching that works within the ANE’s constraints (fixed size inputs).
I wrote about it here[0] but the gist is you can have a fixed size cache and slide it in chunks with each inference. Not as efficient as a cache that grows by one each time of course.
Onnxruntime supports CoreML, though if my experience with converting an embedding model to CoreML using Apple's CoreML conversion tool is similar to the ORT maintainers', I can see why it would be unmaintained.
It took multiple tries to get the model to convert at all to the mlpackage format, and then a lot of experimenting to get it to run on the ANE instead of the GPU, only to discover that constant reshaping was killing any performance benefit (either you have a fixed multiplication size or don't bother), and even at a fixed size and using the attention mask, its operations were slower than saturating the GPU with large batches.
I discovered an issue where using the newer iOS 18 standard would cause the model conversion to break, and put an issue in on their GitHub, including an example repository for easy replication. I got a response quickly, but almost a year later, the bug is still unfixed.
Even when George Hotz attempted to hack it to use it without Apple's really bad and unmaintained CoreML library, he gave up because it was impossible without breaking some pretty core OS features (certificate signing IIRC).
The ANE/CoreML is just not serious at all about making their hardware usable at all. Even Apple's internal MLX team can't crack that nut.
ONNX is horrible for anything that has variable input shapes and that is why nobody uses it for LLMs. It fundamentally is poorly designed for anything that doesn't take a fixed size image.
This more than anything feels emblematic to me, that Apple executives are brain dead when it comes to software. AI seemingly being a step too far. (in the s/w direction.) While they could at some level grok classical s/w, NN-s are that Terra Incognita where Apple executives can not possibly follow in. It's just too strange and mysterious world for them to be able to effectively decide or execute on anything. Worked (20yrs ago) in an industrial R&D lab of a mostly h/w manufacturer for 3 years. It looked to me that the worlds of h/w and s/w, the mindsets, diverged pretty quickly on all important considerations of "what should happen next".
Huh? Apple ships heaps of deep learning-based products/features, and they’ve done so for years. But I’ll agree with you, they’re currently behind in generative AI.
This sorta reminds me of the lie that was pushed when the Snapdragon X laptops were being released last year. Qualcomm implied the NPU would be used for LLMs — and I bought into the BS without looking into it. I still use a Snapdragon laptop as my daily driver (it's fine) but for running models locally, it's still a joke. Despite Qualcomm's claims about running 13B parameter models, software like LM Studio only runs on CPU with NPU support merely "planned for future updates." XDA The NPU isn't even faster than the CPU for LLMs — it's just more power-efficient for small models, not the big ones people actually want to run. Their GPUs aren't much better for this purpose either. The only hope for LLMs is the Vulkan support on the Snapdragon X — which still is half-baked.
AFAIK Windows 11 does use the NPU to run Phi Silica language models and this is available to any app through some API. The models are quite small as you said though.
I use Windows 11. Podman/WLS2 works way better than I thought it would. And when Gitbash was finally ported (officially) - that filled other gaps I was missing in my workflow. Windows ARM Python still is lacking all sorts of stuff, but overall I'm pretty productive on it.
I pre/ordered the Snapdragon X Dev kit from Qualcomm - but they ended up delivering a few units -- to only cancel the whole program. The whole thing turned out to be a hot-mess express saga. THAT computer was going to be my Debian rig.
The key benefit is significant lower power usage. Benchmarked llama3.2-1B on my machines; M1 Max (47t/s, ~1.8 watts), M4 Pro (62t/s, ~2.8 watts). The GPU is twice as fast (even faster on the Max), but draws much more power (~20 watts) vs the ANE.
Also the ANE models are limited to 512 tokens of context, so unlikely yet to use these in production.
I always felt that the neural engine was wasted silicon, they could add more gpu cores in that die space and redirect the neural processing api to the gpu as needed. But I'm no expert, so if anyone here has a different opinion I'd love to learn from it.
I'm not a ML guy, but when I needed to train a NN I thought that the my Mac's ANE would help. But actually, despite it being way easier to setup tensorflow + metal + M1 on Mac than to setup tensorflow + cuda + nvidia on Linux, the neural engine cores are not used. Not even for classification, which are their main purpose. I wouldn't say they are wasted silicon, but they are way less useful than what we expect
Eyeballing 3rd party annotated die shots [1], it’s about the size of two GPU cores, but achieves 15.8 tflops. Which is more than the reported 14.7 tflops of the 32-core GPU in the binned M4 Max.
Not really. That's 15.8 fp16 ops compared to 14.7 fp32 ops (that are actually useful outside AI). It would be interesting to see if you can configure the ANE to recover fp32 precision at lower throughput [1].
It seems intuitive that if they design hardware very specifically for these applications (beyond just fast matmuls on a GPU), they could squeeze out more performance.
I was trying to figure the same thing out a couple months ago, and didn't find much information.
It looked like even ANEMLL provides limited low level access to specifically direct processing toward the Apple Neural Engine, because Core ML still acts as the orchestrator. Instead, flags during conversion of a PyTorch or TensorFlow model can specify ANE-optimized operations, quantization, and parameters hinting at compute targets or optimization strategies. For example `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine` during conversion would disfavor the GPU cores.
Anyway, I didn't actually experiment with this, but at the time I thought maybe there could be a strategy of creating a speculative execution framework, with a small ANE-compatible model to act as the draft model paired with a larger target model running on GPU cores. The idea being that the ANE's low latency and high efficiency could accelerate results.
However, I would be interested to hear the perspective of people who actually know something about the subject.
If you did that, you'd stumble into the Apple GPU's lack of tensor acceleration hardware. For an Nvidia-like experience you'd have to re-architecture the GPU to subsume the NPU's role, and if that was easy then everyone would have done it by now.
The README lacks the most important thing: how many more tokens/sec at the same quantization, compared to llama.cpp / MLX? It is worth to switch default platforms only if there is a major improvement.
Thank you. Strange. If the memory numbers are accurate, it is so slow because likely layers are loaded from disk before inference of each layer or something like that, otherwise it could not do the inference of such model in 500MB. But if that's what it does, 33% of the speed would be already too fast, likely.
What hardware are you on? Most models are memory bandwidth limited. ANE was limited to 64GB/s prior to M3 Max or M4 pro. If you are on M1, GPU will be significantly faster for 3-8B models due to memory bandwidth rather then ANE capabilities.
Yeah, I looked all over for a comparison and couldn't find anything in the repo, on their social media, etc. I saw some other comments here that said it's supposed to be "15.8 fp16 ops compared to 14.7 fp32 ops" but that isn't really enough to go on. Maybe when I have the time I'll install their TestFlight app and do some comparisons myself.
I'm trying to figure out what the secret sauce for this is. It depends on https://github.com/apple/coremltools - is that the key trick or are there other important techniques going on here?
Apple is a competitive choice simply because their unified memory allows you to get enough ram that would take multiple Gpus to have enough space to run larger models.
Yes, but their refusal to open up the ANE to third-party models negates that. You can get (or will be able to very soon) a Strix Halo Ryzen AI Max+ 395 able to access 96GB of unified RAM (on a 128GB system) for well under half what you'd pay for an equivalent M4 system from Apple.
Hobbyists don't use ANE on Apple Silicon and they won't use XDNA on Strix Halo either. In both cases the GPU is faster and can access most of system memory.
Is there a performance benefit for inference speed on M-series MacBooks, or is the primary task here simply to get inference working on other platforms (like iOS)? If there is a performance benefit, it would be great to see tokens/s of this vs. Ollama.
Man, Apple's tight grip on ANE is kinda nuts - would love to see the day they let folks get real hands-on. you ever think companies hold stuff back just to keep control, or is there actually some big tech reason for it?
People keep saying this but I'm not seeing the big difference with other NPU varieties. Either way we're still talking about very experimental stuff that also tends to be hardwired towards some pre-determined use case. So I'm not surprised that people are running into problems while trying to make these more broadly useful.
True; everybody's NPU hardware is afflicted by awkward hardware and software constraints that don't come close to keeping pace with the rapidly-shifting interests of ML researchers.
To some degree, that's an unavoidable consequence of how long it takes to design and ship specialized hardware with a supporting software stack. By contrast, ML research is moving way faster because they hardly ever ship anything product-like; it's a good day when the installation instructions for some ML thing only includes three steps that amount to "download more Python packages".
And the lack of cross-vendor standardization for APIs and model formats is also at least partly a consequence of various NPUs evolving from very different starting points and original use cases. For example, Intel's NPUs are derived from Movidius, so they were originally designed for computer vision, and it's not at all a surprise that making them do LLMs might be an uphill battle. AMD's NPU comes from Xilinx IP, so their software mess is entirely expected. Apple and Qualcomm NPUs presumably are still designed primarily to serve smartphone use cases, which didn't include LLMs until after today's chips were designed.
It'll be very interesting to see how this space matures over the next several years, and whether the niche of specialized low-power NPUs survives in PCs or if NVIDIA's approach of only using the GPU wins out. A lot of that depends on whether anybody comes up with a true killer app for local on-device AI.
> It'll be very interesting to see how this space matures over the next several years, and whether the niche of specialized low-power NPUs survives in PCs or if NVIDIA's approach of only using the GPU wins out.
GPU's are gaining their own kinds of specialized blocks such as matrix/tensor compute units, or BVH acceleration for ray-tracing (that may or may not turn out to be useful for other stuff). So I'm not sure that there's any real distinction from that POV - a specialized low-power unit in an iGPU is going to be practically indistinguishable from a NPU, except that it will probably be easier to target from existing GPU API's.
> a specialized low-power unit in an iGPU is going to be practically indistinguishable from a NPU, except that it will probably be easier to target from existing GPU API's.
Possibly, depending on how low the power actually is. We can't really tell from NVIDIA's tensor cores, because waking up an NVIDIA discrete GPU at all has a higher power cost than running an NPU. Intel's iGPUs have matrix units, but I'm not sure if they can match their NPU on power or performance.
Actually, it's a good thing that it's Xilinx IP. The software is nasty to get working, but it is really reliable, because it's used in thousand to ten thousand dollar boards. The cost of writing software for it is way too high though.
I am curious if anyone knows if the neural cores in apple silicon based machines are at all useful in training? I’ve been using the MLX framework but haven’t seen them mentioned anywhere so I’m just wondering if they are only useful for inference? I know whisper.cpp takes advantage of them in the inference context.
Edit: I changed llama.cpp to whisper.cpp - I didn’t realize that llama.cpp doesn’t have a coreml option like whisper.cpp does.
Well, the TensorFlow port to metal was written by Apple, and it doens't use ANE. If even they have chosen to use only GPU probably the ANE wouldn't help in training. I also heard that the ANE is way less powerful than Apple Silicon's GPU, but I don't have numbers
In my understanding, they are intended to be utilized by various apps (and Apple intelligence) for performing different machine learning tasks (inference), in a manner that is unobtrusive to the user and the rest of the system. While a GPU would theoretically be more performant, it could potentially lead to excessive energy consumption, temperature rise, fan noise, and other factors that may not be desirable when performing basic tasks like OCR when viewing an image within an application.
Currently, it is for example used through the "Vision Framework" eg for OCR tasks (for instance, when previewing an image in macOS, it performs OCR in the background using the ANE). Additionally, they are utilized by certain apple intelligence features that are executed locally (eg when I asked writing tools to rewrite this comment, I saw a spike in ANE usage).
They can also be used for diffusion image models (through core ml, diffusers has a nice frontend for that) but my understanding is that they are primarily for "light" ML tasks within an application rather than running larger models (though that's also possible, but they are gonna probably run slower than in gpu).
They are definitely good for local inference as evidenced by the pretty amazing performance increase on Apple silicon when used with whisper.cpp, maybe other frameworks that utilize coreml? I think they’re sorta purpose built for doing matrix math.
I'm pretty sure you can network Macs together via the latest Thunderbolt standards and get pretty decent performance overall. Sure, it will be a bottleneck to some extent but it's still useful for many purposes.
Yes you can do that and shard a very large model across the devices but it's way too slow so you will get no performance gains beyond being able to run a much larger model at all.
I wonder if Apple ever followed up with this: https://github.com/apple/ml-ane-transformers
They claim their ANE-optimized models achieve "up to 10 times faster and 14 times lower peak memory consumption compared to baseline implementations."
AFAIK, neither MLX nor llama.cpp support ANE. Though llama.cpp started exploring this idea [0].
What's weird is that MLX is made by Apple and yet, they can't support ANE given its closed-source API! [1]
[0]: https://github.com/ggml-org/llama.cpp/issues/10453
[1]: https://github.com/ml-explore/mlx/issues/18#issuecomment-184...
Whisper.cpp has a coreml option which gives 3x speed up over cpu only according to the docs: https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#c...
Some outdated information about bare-metal use of the ANE is available from the Whisper.cpp pull req: https://github.com/ggml-org/whisper.cpp/pull/1021 Even more outdated information at: https://github.com/eiln/ane/tree/33a61249d773f8f50c02ab0b9fe... In short, the early (M1/M2) versions of ANE are unlikely to be useful for modern LLM inference due to their seemingly exclusive focus on statically scheduled FP16 and INT8 MADDs.
More extensive information at https://github.com/tinygrad/tinygrad/tree/master/extra/accel... (from the Tinygrad folks, note that this is also similarly outdated) seems to basically confirm the above.
(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)
I wouldn't say that they aren't useful for inference (there are pretty clear performance improvements even from the asahi effort you linked) - it's just that you have to convert the model ahead of time to be compatible with the ANE which is explained in the readme docs for whisper.cpp that I linked above.
I would say though that this likely excludes them from being useful for training purposes.
Note that I was only commenting on modern quantized LLM's that basically avoid formats like FP16 or INT8, preferring lower precision wherever feasible. When in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. So the only feasible benefits are really in the prompt pre-processing phase, and even then only in lower power use compared to GPU, not really in higher speed.
That's really interesting! I didn't know that about the padding behavior here. I am interested to know which models this would include? I know Gemma 3 raw is bf16 - are you just talking about the quantized versions of these? Or are models being released purely as quantized versions these days? I know Google just released a QAT (Quantization Aware Training) model of Gemma 3 27b - but that base model was already released.
Models may be released as unquantized (and even then they are gradually shifting towards lower precisions over time), but most people are going to be running them in a quantized version simply because that gives you the best bang for your buck (you can fit more interesting models on the same hardware). Of course this is strictly about local LLM inference, though one may reasonably assume that the big players are also doing something similar.
My understanding is that model throughput is fundamentally limited at some point by the fact that the ANE is less wide than the GPU.
At that point, the ANE loses because you have to split the model into chunks and only one fits at a time.
What do you mean by less wide? The main bottleneck for transformers is memory bandwidth. ANE has a much lower ceiling than CPU/GPU (yes, despite unified memory).
Chunking is actually beneficial as long as all the chunks can fit into the ANE’s cache. It speeds up compilation for large network graphs and cached loads are negligible cost. On M1 the cache limit is 3-4GB, but it is higher on M2+.
I was referring to both the lower memory bandwidth and lower FLOPs. The GPU can just do… more at once? For now. Or is that changing?
I had also assumed that loading a chunk from the cache was not free because I’ve seen cache eviction on my M1, but it’s good to know that it’s no longer as big of a limitation.
also, I’m a big fan of your work! I played around with your ModernBERT CoreML port a bit ago
For single batch inference of anything remotely LLM you'll hit the memory bound way before FLOPs, so I haven't actually looked at FLOPs much. For raw performance GPU is certainly better. ANE is more energy efficient, but you need larger batches to really benefit.
Maybe cache is the wrong word. This is a limit to how much can be mmap'd for the ANE at once. It's not too hard to hit on M1 if your model is in the GB range. Chunking the model into smaller pieces makes it more likely to "fit", but if it doesn't fit you have to unmap/remap in each forward pass which will be noticeable.
Awesome to hear about ModernBERT! Big fan of your work as well :)
> coreml option which gives 3x speed up over cpu
Which is still painfully slow. CoreML is not a real ML platform.
.. who is running LLMs on CPU instead of GPU or TPU/NPU
Actually that's a really good question, I hadn't considered that the comparison here is just CPU vs using Metal (CPU+GPU).
To answer the question though - I think this would be used for cases where you are building an app that wants to utilize a small AI model while at the same time having the GPU free to do graphics related things, which I'm guessing is why Apple stuck these into their hardware in the first place.
Here is an interesting comparison between the two from a whisper.cpp thread - ignoring startup times - the CPU+ANE seems about on par with CPU+GPU: https://github.com/ggml-org/whisper.cpp/pull/566#issuecommen...
It essentially never makes sense to run on the CPU and you will only ever see enthusiasts doing it.
Yes, hammering the GPU too hard can affect the display server, but no, switching to the CPU is not a good alternative
Not switching to the CPU - switching to the ANE (Neural Cores) - if you read the research papers Apple has released - the example I gave is pretty much how it's being used - small image classification models running on the ANE, alongside a graphics app that needs the GPU to be free.
Oh, yes, I misread! It’s great for that
Depends on the size of the model and how much VRAM you have (and how long you're willing to wait).
Not all of us own GPUs worth using. Now, among people using macs... Maybe if you had a hardware failure?
One update from Apple Research since that one was "Deploying Attention-Based Vision Transformers to Apple Neural Engine" (but not clear the relationship. It doesn't build on ane_transformers, but maybe a sister project for vision?)
blog: https://machinelearning.apple.com/research/vision-transforme...
github: https://github.com/apple/ml-vision-transformers-ane
Not a public follow-up but the iOS 17 speech-to-text model has a clever approach to KV caching that works within the ANE’s constraints (fixed size inputs).
I wrote about it here[0] but the gist is you can have a fixed size cache and slide it in chunks with each inference. Not as efficient as a cache that grows by one each time of course.
[0]: https://stephenpanaro.com/blog/inside-apples-2023-transforme...
Onnxruntime supports CoreML, though if my experience with converting an embedding model to CoreML using Apple's CoreML conversion tool is similar to the ORT maintainers', I can see why it would be unmaintained.
It took multiple tries to get the model to convert at all to the mlpackage format, and then a lot of experimenting to get it to run on the ANE instead of the GPU, only to discover that constant reshaping was killing any performance benefit (either you have a fixed multiplication size or don't bother), and even at a fixed size and using the attention mask, its operations were slower than saturating the GPU with large batches.
I discovered an issue where using the newer iOS 18 standard would cause the model conversion to break, and put an issue in on their GitHub, including an example repository for easy replication. I got a response quickly, but almost a year later, the bug is still unfixed.
Even when George Hotz attempted to hack it to use it without Apple's really bad and unmaintained CoreML library, he gave up because it was impossible without breaking some pretty core OS features (certificate signing IIRC).
The ANE/CoreML is just not serious at all about making their hardware usable at all. Even Apple's internal MLX team can't crack that nut.
ONNX is horrible for anything that has variable input shapes and that is why nobody uses it for LLMs. It fundamentally is poorly designed for anything that doesn't take a fixed size image.
ANE itself is also limited to fixed computation "shapes" so I'm not sure how much that would matter practically.
Based on the graphs "up to 10 times faster" compares before/after flash attention
This more than anything feels emblematic to me, that Apple executives are brain dead when it comes to software. AI seemingly being a step too far. (in the s/w direction.) While they could at some level grok classical s/w, NN-s are that Terra Incognita where Apple executives can not possibly follow in. It's just too strange and mysterious world for them to be able to effectively decide or execute on anything. Worked (20yrs ago) in an industrial R&D lab of a mostly h/w manufacturer for 3 years. It looked to me that the worlds of h/w and s/w, the mindsets, diverged pretty quickly on all important considerations of "what should happen next".
Huh? Apple ships heaps of deep learning-based products/features, and they’ve done so for years. But I’ll agree with you, they’re currently behind in generative AI.
This sorta reminds me of the lie that was pushed when the Snapdragon X laptops were being released last year. Qualcomm implied the NPU would be used for LLMs — and I bought into the BS without looking into it. I still use a Snapdragon laptop as my daily driver (it's fine) but for running models locally, it's still a joke. Despite Qualcomm's claims about running 13B parameter models, software like LM Studio only runs on CPU with NPU support merely "planned for future updates." XDA The NPU isn't even faster than the CPU for LLMs — it's just more power-efficient for small models, not the big ones people actually want to run. Their GPUs aren't much better for this purpose either. The only hope for LLMs is the Vulkan support on the Snapdragon X — which still is half-baked.
AFAIK Windows 11 does use the NPU to run Phi Silica language models and this is available to any app through some API. The models are quite small as you said though.
AnythingLLM uses NPU
Could you provide a pointer to docs for this? It wasn't obvious from an initial read of their docs.
could find some about it on the 1.7.2 changelog here https://docs.anythingllm.com/changelog/v1.7.2
If you don’t mind me asking, what OS do you use on it?
I use Windows 11. Podman/WLS2 works way better than I thought it would. And when Gitbash was finally ported (officially) - that filled other gaps I was missing in my workflow. Windows ARM Python still is lacking all sorts of stuff, but overall I'm pretty productive on it.
I pre/ordered the Snapdragon X Dev kit from Qualcomm - but they ended up delivering a few units -- to only cancel the whole program. The whole thing turned out to be a hot-mess express saga. THAT computer was going to be my Debian rig.
The key benefit is significant lower power usage. Benchmarked llama3.2-1B on my machines; M1 Max (47t/s, ~1.8 watts), M4 Pro (62t/s, ~2.8 watts). The GPU is twice as fast (even faster on the Max), but draws much more power (~20 watts) vs the ANE.
Also the ANE models are limited to 512 tokens of context, so unlikely yet to use these in production.
I always felt that the neural engine was wasted silicon, they could add more gpu cores in that die space and redirect the neural processing api to the gpu as needed. But I'm no expert, so if anyone here has a different opinion I'd love to learn from it.
I'm not a ML guy, but when I needed to train a NN I thought that the my Mac's ANE would help. But actually, despite it being way easier to setup tensorflow + metal + M1 on Mac than to setup tensorflow + cuda + nvidia on Linux, the neural engine cores are not used. Not even for classification, which are their main purpose. I wouldn't say they are wasted silicon, but they are way less useful than what we expect
Does Apple care about third party use of the ANE? There are many iOS/iPadOS features that use it.
Not really. Apple software uses the neural engine all over the place, but rarely do others. Maybe this will change [1]
There was a guy using it for live video transformations and it almost caused the phones to “melt”. [2]
[1] https://machinelearning.apple.com/research/neural-engine-tra...
[2] https://x.com/mattmireles/status/1916874296460456089
Eyeballing 3rd party annotated die shots [1], it’s about the size of two GPU cores, but achieves 15.8 tflops. Which is more than the reported 14.7 tflops of the 32-core GPU in the binned M4 Max.
[1] https://vengineer.hatenablog.com/entry/2024/10/13/080000
Not really. That's 15.8 fp16 ops compared to 14.7 fp32 ops (that are actually useful outside AI). It would be interesting to see if you can configure the ANE to recover fp32 precision at lower throughput [1].
[1] https://arxiv.org/abs/2203.03341
Apple GPUs run fp16 at the same rate as fp32 except on phones, so it is comparable for ML. No one runs inference from fp32 weights.
But the point was about area efficiency
I guess it's a hard choice as it's 5x more energy efficient than GPU because it uses systolic array.
For laptops, 2x GPU cores would make more sense, for phones/tablets, energy efficiency is everything.
You're completely right, if you already have a GPU in a system adding tensor cores to it gives you better performance per area.
GPU + dedicated AI HW is virtually always the wrong approach compared to GPU+ tensor cores
At least one link/benchmark I saw said the ANE can be 7x faster than GPU (Metal / MPS),
https://discuss.pytorch.org/t/apple-neural-engine-ane-instea...
It seems intuitive that if they design hardware very specifically for these applications (beyond just fast matmuls on a GPU), they could squeeze out more performance.
Performance doesn't matter. Nothing is ever about performance.
It's about performance/power ratios.
I was trying to figure the same thing out a couple months ago, and didn't find much information.
It looked like even ANEMLL provides limited low level access to specifically direct processing toward the Apple Neural Engine, because Core ML still acts as the orchestrator. Instead, flags during conversion of a PyTorch or TensorFlow model can specify ANE-optimized operations, quantization, and parameters hinting at compute targets or optimization strategies. For example `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine` during conversion would disfavor the GPU cores.
Anyway, I didn't actually experiment with this, but at the time I thought maybe there could be a strategy of creating a speculative execution framework, with a small ANE-compatible model to act as the draft model paired with a larger target model running on GPU cores. The idea being that the ANE's low latency and high efficiency could accelerate results.
However, I would be interested to hear the perspective of people who actually know something about the subject.
If you did that, you'd stumble into the Apple GPU's lack of tensor acceleration hardware. For an Nvidia-like experience you'd have to re-architecture the GPU to subsume the NPU's role, and if that was easy then everyone would have done it by now.
M1/M2 shared a GPU design, same with M3/M4. So maybe M5 will have a new design that includes tensor cores in the GPU.
The README lacks the most important thing: how many more tokens/sec at the same quantization, compared to llama.cpp / MLX? It is worth to switch default platforms only if there is a major improvement.
In my testing, tokens per sec is half the speed of the GPU, however the power usage is 10x less — 2 watts ANE vs 20 watts GPU on my M4 Pro.
I ran R1-8B for both anemll[0] and mlx[1][2] models on an M4 Max.
Prompt: "Tell me a long story about the origins of 42 being the answer."
anemll: 9.3 tok/sec, ~500MB of memory used.
mlx 8bit: 31.33 tok/sec, ~8.5GB of memory used.
mlx bf16: 27.17 tok/sec, ~15.7GB of memory used.
Memory results are from activity monitor across any potentially involved processes, but I feel like I might missing something here...
[0] https://huggingface.co/anemll/anemll-DeepSeekR1-8B-ctx1024_0...
[1] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...
[2] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...
Thank you. Strange. If the memory numbers are accurate, it is so slow because likely layers are loaded from disk before inference of each layer or something like that, otherwise it could not do the inference of such model in 500MB. But if that's what it does, 33% of the speed would be already too fast, likely.
What hardware are you on? Most models are memory bandwidth limited. ANE was limited to 64GB/s prior to M3 Max or M4 pro. If you are on M1, GPU will be significantly faster for 3-8B models due to memory bandwidth rather then ANE capabilities.
M4 Max with 128GB of memory.
Yeah, I looked all over for a comparison and couldn't find anything in the repo, on their social media, etc. I saw some other comments here that said it's supposed to be "15.8 fp16 ops compared to 14.7 fp32 ops" but that isn't really enough to go on. Maybe when I have the time I'll install their TestFlight app and do some comparisons myself.
I'm trying to figure out what the secret sauce for this is. It depends on https://github.com/apple/coremltools - is that the key trick or are there other important techniques going on here?
coremltools is the only way to run on ANE, so less of a trick and more of a requirement.
The tricks are more around optimizing for the hardware capabilities/constraints. For instance:
- conv2d is faster than linear (see Apple's post [0]) so you rewrite the model for that (example from the repo [1])
- inputs/outputs are static shapes, so KV cache requires some creativity (I wrote about that here [2])
- compute is float16 (not bfloat16) so occasionally you have to avoid activation overflows
[0]: https://machinelearning.apple.com/research/neural-engine-tra...
[1]: https://github.com/Anemll/Anemll/blob/4bfa0b08183a437e759798...
[2]: https://stephenpanaro.com/blog/kv-cache-for-neural-engine
Apple is a competitive choice simply because their unified memory allows you to get enough ram that would take multiple Gpus to have enough space to run larger models.
Yes, but their refusal to open up the ANE to third-party models negates that. You can get (or will be able to very soon) a Strix Halo Ryzen AI Max+ 395 able to access 96GB of unified RAM (on a 128GB system) for well under half what you'd pay for an equivalent M4 system from Apple.
Hobbyists don't use ANE on Apple Silicon and they won't use XDNA on Strix Halo either. In both cases the GPU is faster and can access most of system memory.
$2,000 vs. $3,500 isn't well under half either.
Doesn't the M3 Ultra have 3-4x the RAM bandwidth though?
At 3-4x the price.
But coreml utilizes ANE, right? Is there some bottleneck in coreml that requires lower level access?
Memory bandwidth is the main bottleneck. It got better with M3/M4. ANE is really fast in FLOPS but low in memory bandwidth.
Is there a performance benefit for inference speed on M-series MacBooks, or is the primary task here simply to get inference working on other platforms (like iOS)? If there is a performance benefit, it would be great to see tokens/s of this vs. Ollama.
See my other comment for results.
mlx is much faster, but anemll appeared to use only 500MB of memory compared to the 8GB mlx used.
Man, Apple's tight grip on ANE is kinda nuts - would love to see the day they let folks get real hands-on. you ever think companies hold stuff back just to keep control, or is there actually some big tech reason for it?
People keep saying this but I'm not seeing the big difference with other NPU varieties. Either way we're still talking about very experimental stuff that also tends to be hardwired towards some pre-determined use case. So I'm not surprised that people are running into problems while trying to make these more broadly useful.
True; everybody's NPU hardware is afflicted by awkward hardware and software constraints that don't come close to keeping pace with the rapidly-shifting interests of ML researchers.
To some degree, that's an unavoidable consequence of how long it takes to design and ship specialized hardware with a supporting software stack. By contrast, ML research is moving way faster because they hardly ever ship anything product-like; it's a good day when the installation instructions for some ML thing only includes three steps that amount to "download more Python packages".
And the lack of cross-vendor standardization for APIs and model formats is also at least partly a consequence of various NPUs evolving from very different starting points and original use cases. For example, Intel's NPUs are derived from Movidius, so they were originally designed for computer vision, and it's not at all a surprise that making them do LLMs might be an uphill battle. AMD's NPU comes from Xilinx IP, so their software mess is entirely expected. Apple and Qualcomm NPUs presumably are still designed primarily to serve smartphone use cases, which didn't include LLMs until after today's chips were designed.
It'll be very interesting to see how this space matures over the next several years, and whether the niche of specialized low-power NPUs survives in PCs or if NVIDIA's approach of only using the GPU wins out. A lot of that depends on whether anybody comes up with a true killer app for local on-device AI.
> It'll be very interesting to see how this space matures over the next several years, and whether the niche of specialized low-power NPUs survives in PCs or if NVIDIA's approach of only using the GPU wins out.
GPU's are gaining their own kinds of specialized blocks such as matrix/tensor compute units, or BVH acceleration for ray-tracing (that may or may not turn out to be useful for other stuff). So I'm not sure that there's any real distinction from that POV - a specialized low-power unit in an iGPU is going to be practically indistinguishable from a NPU, except that it will probably be easier to target from existing GPU API's.
> a specialized low-power unit in an iGPU is going to be practically indistinguishable from a NPU, except that it will probably be easier to target from existing GPU API's.
Possibly, depending on how low the power actually is. We can't really tell from NVIDIA's tensor cores, because waking up an NVIDIA discrete GPU at all has a higher power cost than running an NPU. Intel's iGPUs have matrix units, but I'm not sure if they can match their NPU on power or performance.
Actually, it's a good thing that it's Xilinx IP. The software is nasty to get working, but it is really reliable, because it's used in thousand to ten thousand dollar boards. The cost of writing software for it is way too high though.
I am curious if anyone knows if the neural cores in apple silicon based machines are at all useful in training? I’ve been using the MLX framework but haven’t seen them mentioned anywhere so I’m just wondering if they are only useful for inference? I know whisper.cpp takes advantage of them in the inference context.
Edit: I changed llama.cpp to whisper.cpp - I didn’t realize that llama.cpp doesn’t have a coreml option like whisper.cpp does.
Well, the TensorFlow port to metal was written by Apple, and it doens't use ANE. If even they have chosen to use only GPU probably the ANE wouldn't help in training. I also heard that the ANE is way less powerful than Apple Silicon's GPU, but I don't have numbers
Maybe a quick side shift - What the heck are apples neural cores good for ? Used for ? Use cases ?
In my understanding, they are intended to be utilized by various apps (and Apple intelligence) for performing different machine learning tasks (inference), in a manner that is unobtrusive to the user and the rest of the system. While a GPU would theoretically be more performant, it could potentially lead to excessive energy consumption, temperature rise, fan noise, and other factors that may not be desirable when performing basic tasks like OCR when viewing an image within an application.
Currently, it is for example used through the "Vision Framework" eg for OCR tasks (for instance, when previewing an image in macOS, it performs OCR in the background using the ANE). Additionally, they are utilized by certain apple intelligence features that are executed locally (eg when I asked writing tools to rewrite this comment, I saw a spike in ANE usage).
They can also be used for diffusion image models (through core ml, diffusers has a nice frontend for that) but my understanding is that they are primarily for "light" ML tasks within an application rather than running larger models (though that's also possible, but they are gonna probably run slower than in gpu).
They are definitely good for local inference as evidenced by the pretty amazing performance increase on Apple silicon when used with whisper.cpp, maybe other frameworks that utilize coreml? I think they’re sorta purpose built for doing matrix math.
They're used pretty extensively by stuff like the Photos app for various ML tasks. Not all AI is LLMs.
Yet Siri is still dumber than a doorknob…
What about Microsoft copilot AI notebooks? Would they run anything quick enough to be useful?
Even 1gb model is prohibitively big for phones if you want mass adoption.
The 1B model works on iPhones[0].
See my other comments. anemll appears to use less memory.
[0] https://huggingface.co/anemll/anemll-llama-3.2-1B-iOSv2.0
We'll just have to await next week's oai-agi1-0.4b-a0.1b-iq1_xs.gguf
Getting anything to work on Apple proprietary junk is such a chore.
btw, don't bother trying to buy a bunch of Mac boxes to run LLMs in parallel because it won't be any faster than a single box.
is everyone just waiting for teh DGX Spark? Are they really going to ban local inference?
What do you mean ban? The bandwidth between macs isn't enough to do inference effectively.
I'm pretty sure you can network Macs together via the latest Thunderbolt standards and get pretty decent performance overall. Sure, it will be a bottleneck to some extent but it's still useful for many purposes.
Yes you can do that and shard a very large model across the devices but it's way too slow so you will get no performance gains beyond being able to run a much larger model at all.
thats a performance gain