So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight.
Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.
I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates.
Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.
Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.
As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.
I find the geometrical intuition of rotating a vector in high dimensional space to minimize its largest values (vector basis projections) beautiful.
I'm no expert and I'm sure this has been tried by many people already - but would it be possible to reduce the computational effort instead by using SVD decomposition, spreading the singular values and then reapplying the original singular values and recomposing the matrix using the quantized versions of the SVD matrices?
Tangentially related to the idea of "apply a random rotation matrix" is one where you apply a random matrix to a set of points to preserve distances between them but transform them into a lower dimensional space. This is captured by the JL Lemma [1].
It's pretty interesting that the new SpinQuant method did not manage to be better than good old nf4bit QLORA training (Tim Dettmers really cooked with that one).
Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.
I think meta and facebook before it have always valued a very high standard of engineering, and have also been generally pretty good about open sourcing a lot of that work in a way that allows a lot of people to work with their tools. This doesn’t seem all that out of character.
May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/
3B models are perfectly capable, I've had great luck with Phi 3.5.
> For example, they seem to not care about instructions to only write a response and no explanation
You need to use tools to force the model to adhere to a schema. Or you can learn to parse out the part of the response you want, both work.
You'll also need to make good use of robust examples in your initial prompt, and give lots of examples of how you want the output to look. (Yes this quickly burns up the limited context length!)
Finally, embrace the fact that these models are tuned for chat, so the more conversational you make the back and forth the less you are stretching the models abilities.
I wonder if CUE can help the situation in similar fashion to the DSL methods that you've described in your blog post [1]. After all CUE fundamentals are based on feature structure from the deterministic approach of NLP unlike LLM that's stochastic NLP [2],[3]. Perhaps deterministic and non-deterministic approaches is the potent combination that can effectively help reduce much of the footprint to get to the same results and being energy efficient in the process.
[1] Cue – A language for defining, generating, and validating data:
On my LinkedIn post about this topic someone actually replied with a superior method of steering LLM output compared to anything else I've ever heard of, so I've decided that until I find time to implement their method, I'm not going to worry about things.
tl;dr you put into the prompt all the JSON up until what you want the LLM to say, and you set the stop token to the end token of the current JSON item (so ',' or '}' ']', whatever) and you then your code fills out the rest of the JSON syntax up until another LLM generated value is needed.
I hope that makes sense.
It is super cool, and I am pretty sure there is a way to make a generator that takes in an arbitrary JSON schema and builds a state machine to do the above.
The performance should be super fast on locally hosted models that are using context caching.
Eh I should write this up as a blog post, hope someone else implements it, and if not, just do it myself.
For context, I was playing with a script to bulk download podcasts, transcribe with whisper, pass the transcription to llama.cpp to ID ads, then slice the ads out with ffmpeg. I started with the generic json_array example grammar, then iteratively tweaked it.
For me, it was almost random if I would get a little spiel at the beginning of my response - even on the unquantized 8b instruct. Since ollama doesn’t support grammars, I was trying to get it to work where I had a prompt that summarized an article and extracted and classified certain information that I requested. Then I had another prompt that would digest the summary and spit out a structured JSON output. It was much better than trying to do it in one prompt, but still far too random even with temperature at 0. Sometimes the first prompt misclassified things. Sometimes the second prompt would include a “here’s your structured output”.
Not in production, but I've used a 3B model to test a local LLM application I'm working on. I needed a full end-to-end request/response and it's a lot faster asking a 3B model than an 8B model. I could setup a test harness and replay the responses... but this was a lot simpler.
> For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline
I was doing some local tidying up of recording transcripts, using a fairly long system prompt, and I saw the same behaviour you mention if the transcript I was passing in was too long -- batching it up to make sure to be under the max length prevented this.
Might not be what's happening in your case, but I mention it because it wasn't immediately obvious to me when I first saw the behaviour.
Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!
What was the "vanilla post-training quantization" used for comparison? There are 22 GGUF quantization variants smaller than 16 bits per weight and I can't tell which one is being compared with:
So this should be referring to w8a8 (weights and activations in 8 bit)
So this is gonna be 8 bit weights, 8 bit activations, group size of 256, symmetric quantization. Not sure how to map this to the GGUF variants because they don't mention how they don't do activation quantization
These quantized models show much less degradation compared to a "vanilla post-training-quantization" but there are a bunch of PTQ schemes that people have already applied to Llama models [1]. I didn't see any details about the vanilla PTQ they used as a baseline. Has it been written about elsewhere?
Oh cool! I’ve been playing with quantized llama 3B for the last week. (4-bit spinquant). The code for spinquant has been public for a bit.
It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.
But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.
My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.
Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.
MLC Chat is a great iPhone app for running models (it's on Android too) and currently ships with Llama 3.2 3B Instruct - not the version Meta released today, its a quantized version of their previous release.
I wouldn't be surprised to see it add the new ones shortly, it's quite actively maintained.
This was just recently open sourced and is pretty nice. Only issue I've had is very minor UI stuff (on Android, sounds like it runs better on iOS from skimming comments)
I'm on Android, however my somewhat elaborate solution was to install Ollama on my home laptop computer and then ssh in when I want to query a model. I figured that'd be better for my phone battery. Since my home computer is behind NAT I run yggdrasil on everything so I can access my AI on the go.
Does anyone know why the most common method to speed up inference time is quantization? I keep hearing about all sorts of new methods but nearly none of them is implemented in practice (except for flash attention).
It's particularly useful in memory bound workflows like batch size = 1 LLM inference where you're bottlenecked by how quickly you can send weights to your GPU. This is why at least in torchao we strongly recommend people try out int4 quantization.
At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8
Because the way LLMs work is more-or-less "for every token, read the entire matrix from memory and do math on it". Math is fast, so if you manage to use only half the bits to store each item in the matrix, you only have to do half as much work. Of course, sometimes those least-significant-bits were relied-upon in the original training.
These undergo additional fine tuning (QLoRA) using some or all of the original dataset, so they're able to get the weights to align to the nf4 dtype better, which increases the accuracy.
TLDR: Quantized versions of Llama 3.2 1B and 3B models with "competitive accuracy" to the original versions (meaning some degraded performance; plots included in the release notes).
I don’t get the comment. For one I’m excited for developments in the field. Not afraid it will “replace me” as technology has replaced me multiple times over. I’m looking towards working with these models more and more.
No, I meant that a lot of us are working very fast on a pre-launch product, implementing some cutting edge ideas using e.g. the incredible speedup in a small fast inference model like quantized 3B in combination with other tools, and I think there's quite a bit of paranoia out there that someone else will beat you to market. And so not a lot of sharing going on in the comments. At least not as much as previously, and not as much technical discussion vs other non-AI threads on HN.
I’m focused on making models play nice with each other rather than building a feature that relies on it. That’s where I see the more relevant work being. Why such news are exciting!
What kind of fundamental discussion are you hoping to see under an article about an iterative improvement to a known model?
"AI will destroy the world"? "AI is great and will save humanity"? If you're seriously missing that, there's really enough platforms (and articles for more fundamental announcements/propositions on this one) where you can have these.
I wouldn't be so haughty and presumptive of your understanding of things is as they are: this doesn't have practical applications.
No one serious is going to build on some horror of Python interpreter running inside your app to run an LLM when llama.cpp is right there, with more quants available. In practice, on mobile, you run out of RAM headroom way more quickly than CPU headroom. You've been able to run llama.cpp 3B models for almost a year now on iOS, whereas here, they're just starting to be able to. (allocating 6 GB is a quick way to get autokill'd on iOS...2.5GB? Doable)
It looks like spinquant is effectively Q8, in widespread blind testing over months, empirically, we found Q5 is assuredly indistinguishable from the base model.
(edit: just saw your comment. oy. best of luck! generally, I don't bother with these sorts of 'lived experience' details, because no one wants to hear they don't get it, and most LLM comments on HN are from ppl who don't have the same luck as to work on it fulltime. so you're either stuck aggressively asserting you're right in practice and they don't know what you're talking about, or, you're stuck being talked down to about things you've seen, even if they don't match a first-pass based on theory) https://news.ycombinator.com/item?id=41939841
I mean, this outcome of LLMs is expected and the frequency of LLM drops are too fast, and definitely too fast to wait for Meta to do an annual conference with a ton of hype, and furthermore these things are just prerequisites for a massive lemming rush of altering these models for the real fun, which occurs in other communities
So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight.
Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.
I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.
Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.
As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.
[1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf
I find the geometrical intuition of rotating a vector in high dimensional space to minimize its largest values (vector basis projections) beautiful.
I'm no expert and I'm sure this has been tried by many people already - but would it be possible to reduce the computational effort instead by using SVD decomposition, spreading the singular values and then reapplying the original singular values and recomposing the matrix using the quantized versions of the SVD matrices?
Tangentially related to the idea of "apply a random rotation matrix" is one where you apply a random matrix to a set of points to preserve distances between them but transform them into a lower dimensional space. This is captured by the JL Lemma [1].
[1] - https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...
It's pretty interesting that the new SpinQuant method did not manage to be better than good old nf4bit QLORA training (Tim Dettmers really cooked with that one).
Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.
It’s a little bizarre that I feel like I’m actually starting to respect this little bit of Meta…
I think meta and facebook before it have always valued a very high standard of engineering, and have also been generally pretty good about open sourcing a lot of that work in a way that allows a lot of people to work with their tools. This doesn’t seem all that out of character.
May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/
3B models are perfectly capable, I've had great luck with Phi 3.5.
> For example, they seem to not care about instructions to only write a response and no explanation
You need to use tools to force the model to adhere to a schema. Or you can learn to parse out the part of the response you want, both work.
You'll also need to make good use of robust examples in your initial prompt, and give lots of examples of how you want the output to look. (Yes this quickly burns up the limited context length!)
Finally, embrace the fact that these models are tuned for chat, so the more conversational you make the back and forth the less you are stretching the models abilities.
I wrote a very small blog post at https://meanderingthoughts.hashnode.dev/unlock-the-full-pote... explaining some of this.
I wonder if CUE can help the situation in similar fashion to the DSL methods that you've described in your blog post [1]. After all CUE fundamentals are based on feature structure from the deterministic approach of NLP unlike LLM that's stochastic NLP [2],[3]. Perhaps deterministic and non-deterministic approaches is the potent combination that can effectively help reduce much of the footprint to get to the same results and being energy efficient in the process.
[1] Cue – A language for defining, generating, and validating data:
https://news.ycombinator.com/item?id=20847943
[2] Feature structure:
https://en.m.wikipedia.org/wiki/Feature_structure
[3] The Logic of CUE:
https://cuelang.org/docs/concept/the-logic-of-cue/
On my LinkedIn post about this topic someone actually replied with a superior method of steering LLM output compared to anything else I've ever heard of, so I've decided that until I find time to implement their method, I'm not going to worry about things.
tl;dr you put into the prompt all the JSON up until what you want the LLM to say, and you set the stop token to the end token of the current JSON item (so ',' or '}' ']', whatever) and you then your code fills out the rest of the JSON syntax up until another LLM generated value is needed.
I hope that makes sense.
It is super cool, and I am pretty sure there is a way to make a generator that takes in an arbitrary JSON schema and builds a state machine to do the above.
The performance should be super fast on locally hosted models that are using context caching.
Eh I should write this up as a blog post, hope someone else implements it, and if not, just do it myself.
There are many solutions for constrained/structured generation with LLMs these days, here is a blog post my employer published about this a while back: https://monadical.com/posts/how-to-make-llms-speak-your-lang...
I'm partial to Outlines lately, but they all have various upsides and downsides.
OpenAI even natively added support for this on their platform recently: https://openai.com/index/introducing-structured-outputs-in-t...
I’ve only toyed with them a bit, and had a similar experience - but did find I got better output by forcing them to adhere to a fixed grammar: https://github.com/ggerganov/llama.cpp/tree/master/grammars
For context, I was playing with a script to bulk download podcasts, transcribe with whisper, pass the transcription to llama.cpp to ID ads, then slice the ads out with ffmpeg. I started with the generic json_array example grammar, then iteratively tweaked it.
For me, it was almost random if I would get a little spiel at the beginning of my response - even on the unquantized 8b instruct. Since ollama doesn’t support grammars, I was trying to get it to work where I had a prompt that summarized an article and extracted and classified certain information that I requested. Then I had another prompt that would digest the summary and spit out a structured JSON output. It was much better than trying to do it in one prompt, but still far too random even with temperature at 0. Sometimes the first prompt misclassified things. Sometimes the second prompt would include a “here’s your structured output”.
And Claude did everything perfectly ;)
Yes, I've used the v3.2 3B-Instruct model in a Slack app. Specifically using vLLM, with a template: https://github.com/vllm-project/vllm/blob/main/examples/tool...
Works as expected if you provide a few system prompts with context.
Not in production, but I've used a 3B model to test a local LLM application I'm working on. I needed a full end-to-end request/response and it's a lot faster asking a 3B model than an 8B model. I could setup a test harness and replay the responses... but this was a lot simpler.
If for testing then why not just mock the whole thing for ultimate performance ... ?
Probably faster to use off the shelf model with llama.cpp than to mock it
> For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline
I was doing some local tidying up of recording transcripts, using a fairly long system prompt, and I saw the same behaviour you mention if the transcript I was passing in was too long -- batching it up to make sure to be under the max length prevented this.
Might not be what's happening in your case, but I mention it because it wasn't immediately obvious to me when I first saw the behaviour.
Qwen2.5 3b is very very good.
Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!
What was the "vanilla post-training quantization" used for comparison? There are 22 GGUF quantization variants smaller than 16 bits per weight and I can't tell which one is being compared with:
https://huggingface.co/docs/hub/en/gguf#quantization-types
It might even mean a non-GGUF quantization scheme; I'm just an intermediate user of local models, not an expert user or developer.
So this should be referring to w8a8 (weights and activations in 8 bit)
So this is gonna be 8 bit weights, 8 bit activations, group size of 256, symmetric quantization. Not sure how to map this to the GGUF variants because they don't mention how they don't do activation quantization
These quantized models show much less degradation compared to a "vanilla post-training-quantization" but there are a bunch of PTQ schemes that people have already applied to Llama models [1]. I didn't see any details about the vanilla PTQ they used as a baseline. Has it been written about elsewhere?
[1] https://ollama.com/library/llama3.2/tags
Oh cool! I’ve been playing with quantized llama 3B for the last week. (4-bit spinquant). The code for spinquant has been public for a bit.
It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.
But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.
My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.
Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.
Any pointers no how to finetune this on my dataset and package and run it in my swift ios app?
Anyone know a nice iOS app to run these locally?
MLC Chat is a great iPhone app for running models (it's on Android too) and currently ships with Llama 3.2 3B Instruct - not the version Meta released today, its a quantized version of their previous release.
I wouldn't be surprised to see it add the new ones shortly, it's quite actively maintained.
https://apps.apple.com/us/app/mlc-chat/id6448482937
I access them by running the models in Ollama (on my own hardware), and then using my app Chaz[1] to access it through my normal Matrix client.
[1] - https://github.com/arcuru/chaz
https://github.com/a-ghorbani/pocketpal-ai
This was just recently open sourced and is pretty nice. Only issue I've had is very minor UI stuff (on Android, sounds like it runs better on iOS from skimming comments)
I'm on Android, however my somewhat elaborate solution was to install Ollama on my home laptop computer and then ssh in when I want to query a model. I figured that'd be better for my phone battery. Since my home computer is behind NAT I run yggdrasil on everything so I can access my AI on the go.
I've been using PocketGPT.
Does anyone know why the most common method to speed up inference time is quantization? I keep hearing about all sorts of new methods but nearly none of them is implemented in practice (except for flash attention).
It's particularly useful in memory bound workflows like batch size = 1 LLM inference where you're bottlenecked by how quickly you can send weights to your GPU. This is why at least in torchao we strongly recommend people try out int4 quantization.
At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8
Because the way LLMs work is more-or-less "for every token, read the entire matrix from memory and do math on it". Math is fast, so if you manage to use only half the bits to store each item in the matrix, you only have to do half as much work. Of course, sometimes those least-significant-bits were relied-upon in the original training.
How do they compare to their original quants on ollama like q4_K_S?
These undergo additional fine tuning (QLoRA) using some or all of the original dataset, so they're able to get the weights to align to the nf4 dtype better, which increases the accuracy.
TLDR: Quantized versions of Llama 3.2 1B and 3B models with "competitive accuracy" to the original versions (meaning some degraded performance; plots included in the release notes).
Quantization schemes include post-training quantization (PTQ), SpinQuant, and QLoRA.
Funny how quiet the HN comments on big AI news with serious practical applications, like this, are getting. ;-)
Two days ago there was a pretty big discussion on this topic:
I don’t get the comment. For one I’m excited for developments in the field. Not afraid it will “replace me” as technology has replaced me multiple times over. I’m looking towards working with these models more and more.
No, I meant that a lot of us are working very fast on a pre-launch product, implementing some cutting edge ideas using e.g. the incredible speedup in a small fast inference model like quantized 3B in combination with other tools, and I think there's quite a bit of paranoia out there that someone else will beat you to market. And so not a lot of sharing going on in the comments. At least not as much as previously, and not as much technical discussion vs other non-AI threads on HN.
Ok, thank you for pointing that out.
I’m focused on making models play nice with each other rather than building a feature that relies on it. That’s where I see the more relevant work being. Why such news are exciting!
This thread attracts a smaller audience than, say, a new version of ChatGPT.
What kind of fundamental discussion are you hoping to see under an article about an iterative improvement to a known model?
"AI will destroy the world"? "AI is great and will save humanity"? If you're seriously missing that, there's really enough platforms (and articles for more fundamental announcements/propositions on this one) where you can have these.
Aren't we all just tired of arguing the same points?
I wouldn't be so haughty and presumptive of your understanding of things is as they are: this doesn't have practical applications.
No one serious is going to build on some horror of Python interpreter running inside your app to run an LLM when llama.cpp is right there, with more quants available. In practice, on mobile, you run out of RAM headroom way more quickly than CPU headroom. You've been able to run llama.cpp 3B models for almost a year now on iOS, whereas here, they're just starting to be able to. (allocating 6 GB is a quick way to get autokill'd on iOS...2.5GB? Doable)
It looks like spinquant is effectively Q8, in widespread blind testing over months, empirically, we found Q5 is assuredly indistinguishable from the base model.
(edit: just saw your comment. oy. best of luck! generally, I don't bother with these sorts of 'lived experience' details, because no one wants to hear they don't get it, and most LLM comments on HN are from ppl who don't have the same luck as to work on it fulltime. so you're either stuck aggressively asserting you're right in practice and they don't know what you're talking about, or, you're stuck being talked down to about things you've seen, even if they don't match a first-pass based on theory) https://news.ycombinator.com/item?id=41939841
A sign of the ongoing commoditization?
I mean, this outcome of LLMs is expected and the frequency of LLM drops are too fast, and definitely too fast to wait for Meta to do an annual conference with a ton of hype, and furthermore these things are just prerequisites for a massive lemming rush of altering these models for the real fun, which occurs in other communities