That's amazing. I'm developing sub-tools for LLM as a hobby on an RTX3050 (4GB), but I can only run lightweight models like 1B and 2B. Is it possible to use your tool to make the CPU take over some of the VRAM movement?
The title is misleading — there's no trained 100B model, just an inference framework that claims to handle one. But the engineering is worth paying attention to.
I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck. The 1.58-bit approach is interesting because ternary weights turn matmuls into additions — a fundamentally different compute profile on commodity CPUs. If 5-7 tok/s on a single CPU for 100B-class models is reproducible, that's a real milestone for on-device inference.
Framework is ready. Now we need someone to actually train the model.
> Framework is ready. Now we need someone to actually train the model.
If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?
Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?
Because this is Microsoft, experimenting and failing is not encouraged, taking less risky bets and getting promoted is. Also no customer asked them to have 1-bit model, hence PM didn't prioritize it.
But it doesn't mean, idea is worthless.
You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.
> You could have said same about Transformers, Google released it, but didn't move forward,
I don't think you can, Google looked at the research results, and continued researching Transformers and related technologies, because they saw the value for it particularly in translations. It's part of the original paper, what direction to take, give it a read, it's relatively approachable for being a machine learning paper :)
Sure, it took OpenAI to make it into an "assistant" that answered questions, but it's not like Google was completely sleeping on the Transformer, they just had other research directions to go into first.
> But it doesn't mean, idea is worthless.
I agree, they aren't, hope that wasn't what my message read as :) But, ideas that don't actually pan out in reality are slightly less useful than ideas that do pan out once put to practice. Root commentator seems to try to say "This is a great idea, it's all ready, only missing piece is for someone to do the training and it'll pan out!" which I'm a bit skeptical about, since it's been two years since they introduced the idea.
Google had been working on a big LLM but they wanted to resolve all the safety concerns before releasing it. It was only when OpenAI went "YOLO! Check this out!" that Google then internally said, "Damn the safety concerns, full speed ahead!" and now we find ourselves in this breakneck race in which all safety concerns have been sidelined.
Scaling seemed like the important idea that everyone was chasing. OpenAI used to be a lot more safety minded because it was in their non profit charter, now they’ve gone for-profit and weaponized their tech for the USA military. Pretty wild turnaround. Saying OpenAI was cavalier with safety in the early days is inaccurate. It was a skill issue. Remember Bard? Google was slow.
What OpenAI did was train increasingly large transformer model instances. which was sensible because transformers allowed for a scaling up of training compared to earlier models. The resulting instances (GPT) showed good understanding of natural language syntax and generation of mostly sensible text (which was unprecedented at the time) so they made ChatGPT by adding new stages of supervised fine tuning and RLHF to their pretrained text-prediction models.
On the one hand, not publishing any new models for an architecture in almost a year seems like forever given how things are moving right now. On the other hand I don't think that's very conclusive on whether they've given up on it or have other higher priority research directions to go into first either
The most benign answer would be that they don’t want to further support an emerging competitor to OpenAI, which they have significant business ties to. I think the more likely answer which you hinted at is that the utility of the model falls apart as scale increases. They see the approach as a dead end so they are throwing the scraps out to the stray dogs.
Not to mention Microsoft's investments in Nvidia and other GPU-adjacent/dependent companies!
A successful ternary model would basically erase all that value overnight. In fact, the entire stock market could crash!
Think about it: This is Microsoft we're talking about! They're a convicted monopolist that has a history of manipulating the market for IT goods and services. I wouldn't put it past them to refuse to invest in training a ternary
model or going so far as to buy up ternary startups just to shut them down.
Want to make some easy money: Start a business training a ternary model and make an offer to Microsoft. I bet they'll buy you out for at least a few million even if you don't have a product yet!
I've also always though that it's an interesting opportunity for custom hardware. Two bit addition is incredibly cheap in hardware, especially compared to anything involving floating point. You could make huge vector instructions on the cheap, then connect it to the fastest memory you can buy, and you have a capable inference chip.
You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making
You only need GPUs if you assume the training is gradient descent. GAs or anything else that can handle nonlinearities would be fine, and possibly fast enough to be interesting.
> a fundamentally different compute profile on commodity CPU
In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction
The win is in how many weights you process per instruction and how much data you load.
So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.
You drop the memory throughput requirements because of the packed representation of bits so an FMA can become the bottleneck, and you bypass the problem of needing to upscale the bits to whatever FP the FMA instruction needs.
typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.
Yes. I had to read it over twice, it does strike me as odd that there wasn't a base model to work with.
But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.
Approximately as challenging as training a regular 100B model from scratch. Maybe a bit more challenging because there's less experience with it
The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture
What's unusual about it? It seems pretty standard to train small models to validate an approach, and then show that training scales with model size to 8B to 14B parameter models, which is what they did.
That issue appears to be the one that's wrong. From the technical report
> We evaluated bitnet.cpp in terms of both inference speed and energy cost. Comprehensive tests were conducted on models with various parameter sizes, ranging from 125M to 100B. specific configurations for each model are detailed in the Appendix A.
Thanks for pointing that out. I'll ask the issue creator if they've considered that. Would be nice if the maintainer would handle that (sigh) and link to the actual models used for testing (double sigh).
I browsed through the history of the user and confirm this statement. I know that there are users who say they used em-dashes even before the rise of ChatGPT and HN statistics support that. For example, one prominent example is dang.
However this user uses — in almost all his posts and he had a speed of 1 comment per minute or so on multiple different topics.
Hmm, the user joined in 2019 but had no submissions or comments until just 40 minutes ago (at least judging by the lack of a second page?) and all the comments are on AI related submissions. Benefit of doubt is it'd have to be a very dedicated lurker or dormant account they remembered they had.
Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.
There has been a shift to the Ai accounts, they use Show HN less now. This started before dang's comment, I assume because they saw the earlier posts about the increase in quantity / decrease in quality.
I suspect that they are trying to fake engagement prior to making their first "show" post as well.
Funny enough I now involuntarily take RTFA as a slight slop signal, because all these accounts dutifully read the article before commenting, unlike most HNers who often respond to headlines.
I’ve been rounded up for things I wrote two decades ago because of my em dashes lol. The pitchfork mentality gives me little hope for how things are going to go once we have hive mind AGI robots pervasive in society.
If I was operating a bot farm, at this point I would probably add some bots that go around and accuse legit human users (or just random users) of being bots.
Created confusion and frustration will make it much harder to separate signal from the noise for most people.
Not all of them do: https://news.ycombinator.com/item?id=47335156 There are evidently lots of people experimenting with different botting setups. Some do better at blending in than others.
Interesting - the account you mention, and the GP, are both doing replies that are themselves all about the same length, and also the same length between the two accounts. I get what you mean.
I would love to understand the thought process behind this. I'm sure it's a fun experiment, to see if it's possible and so on... but what tangible benefit could there be to burning tokens to spam comments on every post?
> bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
Log Base 2 of 3 = ~1.5849625, so that's the limit to how well you can pack three-state values into bits of data.
For something more practical, you can pack five three-state values within a byte because 3^5 = 243, which is smaller than 256. To unpack, you divide and modulo by 3 five separate times. This encodes data in bytes at 1.6 bits per symbol.
But the packing of 5 symbols into a byte was not done here. Instead, they packed 4 symbols into a byte to reduce computational complexity (no unpacking needed)
Yeah, "1.58 bit" is 1 trit with three states, since log2(3)≈1.58.
So it's not a inference framework for 1-bit models (two states per parameter) but for 1.58 bit models (three states per parameter). Annoying that they try to mix up the two.
One of the things I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers? I'm surprised something like Encyclopedia Britanica hasn't yet (afaik) tried to capitalize on AI by selling their data to LLMs and validating outputs for LLM companies, it would make a night and day difference in some areas I would think. Wikipedia is nice, but there's so much room for human error and bias there.
Your worry about Wikipedia is that there is "much room for human error and bias", yet earlier you seem to imply that a LLM that has access to the www somehow would have less human error and bias? Personally, I'd see it the other way around.
It's not so much a "minimally viable LLM" but rather an LLM that knows natural language well but knows nothing else. Like me - as an engineer who knows how to troubleshoot in general but doesn't know about a specific device like my furnace (recent example).
And I don't think that LLM could just Google or check Wikipedia.
But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.
Correct! I know RAG is a thing, but I wish we could have "DLCs" for LLMs like image generation has LoRa's which are cheaper to train for than retraining the entire model, and provide more output like what you want. I would love to pop in the CS "LoRa or DLC" and ask it about functional programming in Elixir, or whatever.
Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.
How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.
Isn’t that sort of what a RAG is? You’d need an LLM “smart” enough to turn natural-user prompts into searches, then some kind of search, then an LLM “smart” though to summarize the results.
Yeah, I think RAG is the idea that will lead us there, though its a little complicated, because for some subjects, say Computer Science, you need a little more than just "This is Hello World in Go" you might need to understand not just Go syntax on the fly, but more CS nuances that are not covered in one single simple document. The idea being having a model that runs fully locally on a phone or laptop with minimal resources. On the other hand, I can also see smaller models talking to larger models that are cheaper to run in the cloud. I am wondering if this is the approach Apple might take with Siri, specifically in order to retain user privacy as much as possible.
Wikipedia has proven to be as accurate as encyclopedias for decades now. Also, I'm betting AI companies have illegally trained their models on the Encyclopedia Britanica's data by now.
It's good to see this getting some continued development. I looked into it last year[1] and I thought it showed a lot of promise so I've been very disappointed that I never saw a newer model.
I think this approach is not so interesting because it's just quantization of a full precision model. So it speeds up inference (at a quality penalty) but not training. It would be more interesting to train an actually binary model directly, without any floating point multiplication, like in this paper: https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...
I think they used a dummy model or else they would have linked to it. Just google '1-bit 100b model' and you'll only see references to this project without any download links.
The output from this model is horrible! It's GPT-2 level babble and repeats entire paragraphs verbatim. It also reuses the same fake citation `(Jenkins, 2010)` over and over again. From the start of their video (which scrolls by fast enough that you don't see the slop clearly...)
```
Ecosystem Services and their impact on the Ecosystem
Ecosystem services refer to the services provided by ecosystems to the human society. These services include water, air, energy, nutrients, and soil (Jenkins, 2010). For instance, water is the most important service provided by an ecosystem and it helps in the conservation of water, irrigation and sanitation (Jenkins, 2010). On the other hand, air provides the oxygen needed for life.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans.
```
I'm curious if 1-bit params can be compared to 4- or 8-bit params. I imagine that 100B is equivalent to something like a 30B model? I guess only evals can say. Still, being able to run a 30B model at good speed on a CPU would be amazing.
At some point you hit information limits. With conventional quantisation you see marked capability fall-off below q5. All else being equal you'd expect an N-parameter 5-bit quant to be roughly comparable to a 3N-parameter ternary, if they are trained to the same level, just in terms of the amount of information they can possibly hold. So yes, 100B ternary would be within the ballpark of a 30B q5 conventional model, with a lot of hand-waving and sufficiently-smart-training
I assume that theoretically, 1-bit models could be most efficient because modern models switched from 32 bit to 16 bit to 8 bit per parameter (without quantization).
If they had a big result like, native 1.58-bit quality clearly matches top peers, they would be saying that prominently in the repo.
The engineering/optimization work is nice, but this is not what people have been waiting for, as much as, can’t the Bitnet idea that seemed promise really deliver in a competitive way.
The project is an inference framework which should support 100B parameter model at 5-7tok/s on CPU. No one has quantized a 100B parameter model to 1 trit, but this existing is an incentive for someone to do so.
Misleading title but this is pretty exciting. Interesting how this is based on llama cpp. Its nice to see some momentum since they released the paper in 2023
There are q2 and q1 quants, if you want an idea of how much performance you'd drop. Not quite the same implementation-wise, but probably equivalent in terms of smarts.
The results would probably be underwhelming. The bitnet paper doesn't give great baselines to compare to, but in their tests a 2B network trained for 1.58bits using their architecture was better than Llama 3 8B quantized to 1.58bits. Though that 2B network was about on par with a 1.5B qwen2.5.
If you have an existing network, making an int4 quant is the better tradeoff. 1.58b quants only become interesting when you train the model specifically for it
On the other hand maybe it works much better than expected because llama3 is just a terrible baseline
I think the README [1] for the new CPU feature is of more interest, showing linear speedups with number of threads. Up to 73 tokens/sec with 8 threads (64 toks/s for their recommended Q6 quant):
It might interest you to know that one or two months ago, I had Claude port BitNet to WebGPU from the reference implementation, so that it runs right in your browser as a local model. After some debugging, the port seemed to work, but the model didn't function as well as the reference implementation so I'll have to work on it for a while. You can see a debugging session livestreamed here[1]. The released model file was about a gigabyte, it fits in most people's GPU's. We were also able to successfully fine-tune it right in the browser.
There's a lot that you can do when the model size is that small, yet still powerful.
Our next step is that we want to put up a content distribution network for it where people can also share their diffs for their own fine-tuned model. I'll post the project if we finish all the parts.
That's amazing. I'm developing sub-tools for LLM as a hobby on an RTX3050 (4GB), but I can only run lightweight models like 1B and 2B. Is it possible to use your tool to make the CPU take over some of the VRAM movement?
The title is misleading — there's no trained 100B model, just an inference framework that claims to handle one. But the engineering is worth paying attention to. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck. The 1.58-bit approach is interesting because ternary weights turn matmuls into additions — a fundamentally different compute profile on commodity CPUs. If 5-7 tok/s on a single CPU for 100B-class models is reproducible, that's a real milestone for on-device inference. Framework is ready. Now we need someone to actually train the model.
> Framework is ready. Now we need someone to actually train the model.
If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?
Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?
Because this is Microsoft, experimenting and failing is not encouraged, taking less risky bets and getting promoted is. Also no customer asked them to have 1-bit model, hence PM didn't prioritize it.
But it doesn't mean, idea is worthless.
You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.
> You could have said same about Transformers, Google released it, but didn't move forward,
I don't think you can, Google looked at the research results, and continued researching Transformers and related technologies, because they saw the value for it particularly in translations. It's part of the original paper, what direction to take, give it a read, it's relatively approachable for being a machine learning paper :)
Sure, it took OpenAI to make it into an "assistant" that answered questions, but it's not like Google was completely sleeping on the Transformer, they just had other research directions to go into first.
> But it doesn't mean, idea is worthless.
I agree, they aren't, hope that wasn't what my message read as :) But, ideas that don't actually pan out in reality are slightly less useful than ideas that do pan out once put to practice. Root commentator seems to try to say "This is a great idea, it's all ready, only missing piece is for someone to do the training and it'll pan out!" which I'm a bit skeptical about, since it's been two years since they introduced the idea.
Google had been working on a big LLM but they wanted to resolve all the safety concerns before releasing it. It was only when OpenAI went "YOLO! Check this out!" that Google then internally said, "Damn the safety concerns, full speed ahead!" and now we find ourselves in this breakneck race in which all safety concerns have been sidelined.
Scaling seemed like the important idea that everyone was chasing. OpenAI used to be a lot more safety minded because it was in their non profit charter, now they’ve gone for-profit and weaponized their tech for the USA military. Pretty wild turnaround. Saying OpenAI was cavalier with safety in the early days is inaccurate. It was a skill issue. Remember Bard? Google was slow.
What OpenAI did was train increasingly large transformer model instances. which was sensible because transformers allowed for a scaling up of training compared to earlier models. The resulting instances (GPT) showed good understanding of natural language syntax and generation of mostly sensible text (which was unprecedented at the time) so they made ChatGPT by adding new stages of supervised fine tuning and RLHF to their pretrained text-prediction models.
On the one hand, not publishing any new models for an architecture in almost a year seems like forever given how things are moving right now. On the other hand I don't think that's very conclusive on whether they've given up on it or have other higher priority research directions to go into first either
The most benign answer would be that they don’t want to further support an emerging competitor to OpenAI, which they have significant business ties to. I think the more likely answer which you hinted at is that the utility of the model falls apart as scale increases. They see the approach as a dead end so they are throwing the scraps out to the stray dogs.
Not to mention Microsoft's investments in Nvidia and other GPU-adjacent/dependent companies!
A successful ternary model would basically erase all that value overnight. In fact, the entire stock market could crash!
Think about it: This is Microsoft we're talking about! They're a convicted monopolist that has a history of manipulating the market for IT goods and services. I wouldn't put it past them to refuse to invest in training a ternary model or going so far as to buy up ternary startups just to shut them down.
Want to make some easy money: Start a business training a ternary model and make an offer to Microsoft. I bet they'll buy you out for at least a few million even if you don't have a product yet!
So is it finally time for a Beowulf cluster to do something amazing?
Cannot agree more!
I've also always though that it's an interesting opportunity for custom hardware. Two bit addition is incredibly cheap in hardware, especially compared to anything involving floating point. You could make huge vector instructions on the cheap, then connect it to the fastest memory you can buy, and you have a capable inference chip.
You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making
You only need GPUs if you assume the training is gradient descent. GAs or anything else that can handle nonlinearities would be fine, and possibly fast enough to be interesting.
> a fundamentally different compute profile on commodity CPU
In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction
The win is in how many weights you process per instruction and how much data you load.
So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.
You drop the memory throughput requirements because of the packed representation of bits so an FMA can become the bottleneck, and you bypass the problem of needing to upscale the bits to whatever FP the FMA instruction needs.
typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.
Bitnet encoding more information dense per byte perhaps? CPUs have slow buses so would eke out more use of bandwidth?
Yes. I had to read it over twice, it does strike me as odd that there wasn't a base model to work with.
But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.
Approximately as challenging as training a regular 100B model from scratch. Maybe a bit more challenging because there's less experience with it
The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture
What's unusual about it? It seems pretty standard to train small models to validate an approach, and then show that training scales with model size to 8B to 14B parameter models, which is what they did.
It comes from (intentionally?) misleading docs: https://github.com/microsoft/BitNet/issues/391
(only suggesting that it's intentional because it's been there so long)
That issue appears to be the one that's wrong. From the technical report
> We evaluated bitnet.cpp in terms of both inference speed and energy cost. Comprehensive tests were conducted on models with various parameter sizes, ranging from 125M to 100B. specific configurations for each model are detailed in the Appendix A.
Thanks for pointing that out. I'll ask the issue creator if they've considered that. Would be nice if the maintainer would handle that (sigh) and link to the actual models used for testing (double sigh).
In their demo they're running 3B model.
> The 1.58-bit approach
can we stop already with these decimals and just call it "1 trit" which it exactly is?
LLM account
I browsed through the history of the user and confirm this statement. I know that there are users who say they used em-dashes even before the rise of ChatGPT and HN statistics support that. For example, one prominent example is dang.
However this user uses — in almost all his posts and he had a speed of 1 comment per minute or so on multiple different topics.
Hmm, the user joined in 2019 but had no submissions or comments until just 40 minutes ago (at least judging by the lack of a second page?) and all the comments are on AI related submissions. Benefit of doubt is it'd have to be a very dedicated lurker or dormant account they remembered they had.
Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.
There has been a shift to the Ai accounts, they use Show HN less now. This started before dang's comment, I assume because they saw the earlier posts about the increase in quantity / decrease in quality.
I suspect that they are trying to fake engagement prior to making their first "show" post as well.
Agreed. This is becoming an issue, see also: https://news.ycombinator.com/item?id=47259308
Funny enough I now involuntarily take RTFA as a slight slop signal, because all these accounts dutifully read the article before commenting, unlike most HNers who often respond to headlines.
First they claimed that if you use em dashes you are not human
And I did not speak out
Because I was not using em dashes
Then they claimed that if you're crammar is to gud you r not hmuan
And I did not spek aut
Because mi gramar sukcs
Then they claimed that if you actually read the article that you are trying to discuss you are not human...
I’ve been rounded up for things I wrote two decades ago because of my em dashes lol. The pitchfork mentality gives me little hope for how things are going to go once we have hive mind AGI robots pervasive in society.
If I was operating a bot farm, at this point I would probably add some bots that go around and accuse legit human users (or just random users) of being bots.
Created confusion and frustration will make it much harder to separate signal from the noise for most people.
Not all of them do: https://news.ycombinator.com/item?id=47335156 There are evidently lots of people experimenting with different botting setups. Some do better at blending in than others.
Interesting - the account you mention, and the GP, are both doing replies that are themselves all about the same length, and also the same length between the two accounts. I get what you mean.
Yeah. It correctly pointed out that the editorialized HN title is wrong, there is no 100B model.
I would love to understand the thought process behind this. I'm sure it's a fun experiment, to see if it's possible and so on... but what tangible benefit could there be to burning tokens to spam comments on every post?
Check out the new QWEN coder model.
Also, isnt there different affinities to 8bit vs 4bit for inferences
>. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck.
I imagine you got 96gb because you thought you'd be running models locally? Did you not know the phrase Unified Memory is marketing speak?
> bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
One bit or one trit? I am confused!
"1-bit LLMs" is just marketing. The Shannon entropy of one letter with a 3 symbol alphabet (-1, 0, 1) is 1.58.
Log Base 2 of 3 = ~1.5849625, so that's the limit to how well you can pack three-state values into bits of data.
For something more practical, you can pack five three-state values within a byte because 3^5 = 243, which is smaller than 256. To unpack, you divide and modulo by 3 five separate times. This encodes data in bytes at 1.6 bits per symbol.
But the packing of 5 symbols into a byte was not done here. Instead, they packed 4 symbols into a byte to reduce computational complexity (no unpacking needed)
>1-bit model
>packed 4 symbols into a byte
microslop, typical bunch of two-bit frauds!
Yeah, "1.58 bit" is 1 trit with three states, since log2(3)≈1.58.
So it's not a inference framework for 1-bit models (two states per parameter) but for 1.58 bit models (three states per parameter). Annoying that they try to mix up the two.
I always hope for "just a bunch of if statements" ... this is not it.
it's if {} else if {} else {}
One of the things I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers? I'm surprised something like Encyclopedia Britanica hasn't yet (afaik) tried to capitalize on AI by selling their data to LLMs and validating outputs for LLM companies, it would make a night and day difference in some areas I would think. Wikipedia is nice, but there's so much room for human error and bias there.
Your worry about Wikipedia is that there is "much room for human error and bias", yet earlier you seem to imply that a LLM that has access to the www somehow would have less human error and bias? Personally, I'd see it the other way around.
When GPT 3.5 became a thing, it had crawled a very nuanced set of websites, this is what I mean. You basically curate where it sources data from.
It's not so much a "minimally viable LLM" but rather an LLM that knows natural language well but knows nothing else. Like me - as an engineer who knows how to troubleshoot in general but doesn't know about a specific device like my furnace (recent example).
And I don't think that LLM could just Google or check Wikipedia.
But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.
Correct! I know RAG is a thing, but I wish we could have "DLCs" for LLMs like image generation has LoRa's which are cheaper to train for than retraining the entire model, and provide more output like what you want. I would love to pop in the CS "LoRa or DLC" and ask it about functional programming in Elixir, or whatever.
Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.
> validating outputs for LLM companies
How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.
Isn’t that sort of what a RAG is? You’d need an LLM “smart” enough to turn natural-user prompts into searches, then some kind of search, then an LLM “smart” though to summarize the results.
Yeah, I think RAG is the idea that will lead us there, though its a little complicated, because for some subjects, say Computer Science, you need a little more than just "This is Hello World in Go" you might need to understand not just Go syntax on the fly, but more CS nuances that are not covered in one single simple document. The idea being having a model that runs fully locally on a phone or laptop with minimal resources. On the other hand, I can also see smaller models talking to larger models that are cheaper to run in the cloud. I am wondering if this is the approach Apple might take with Siri, specifically in order to retain user privacy as much as possible.
Since Google Search already includes an AI summary, your minimally viable "LLM" can be just an HTTP GET call
Wikipedia has proven to be as accurate as encyclopedias for decades now. Also, I'm betting AI companies have illegally trained their models on the Encyclopedia Britanica's data by now.
It's good to see this getting some continued development. I looked into it last year[1] and I thought it showed a lot of promise so I've been very disappointed that I never saw a newer model.
[1] - https://jackson.dev/post/dont-sleep-on-bitnet/
I think this approach is not so interesting because it's just quantization of a full precision model. So it speeds up inference (at a quality penalty) but not training. It would be more interesting to train an actually binary model directly, without any floating point multiplication, like in this paper: https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...
> A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2
With how much RAM? How much storage does it requires?
but there is no trained 100b param model? "can run a 100B BitNet" is about the inference implementation, not about the existence of any such model
I think they used a dummy model or else they would have linked to it. Just google '1-bit 100b model' and you'll only see references to this project without any download links.
The output from this model is horrible! It's GPT-2 level babble and repeats entire paragraphs verbatim. It also reuses the same fake citation `(Jenkins, 2010)` over and over again. From the start of their video (which scrolls by fast enough that you don't see the slop clearly...)
``` Ecosystem Services and their impact on the Ecosystem
Ecosystem services refer to the services provided by ecosystems to the human society. These services include water, air, energy, nutrients, and soil (Jenkins, 2010). For instance, water is the most important service provided by an ecosystem and it helps in the conservation of water, irrigation and sanitation (Jenkins, 2010). On the other hand, air provides the oxygen needed for life.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans. ```
It's a two year old base model that's only 3B parameters, trained on only 100B tokens. It's still a research project at this point.
I'm curious if 1-bit params can be compared to 4- or 8-bit params. I imagine that 100B is equivalent to something like a 30B model? I guess only evals can say. Still, being able to run a 30B model at good speed on a CPU would be amazing.
At some point you hit information limits. With conventional quantisation you see marked capability fall-off below q5. All else being equal you'd expect an N-parameter 5-bit quant to be roughly comparable to a 3N-parameter ternary, if they are trained to the same level, just in terms of the amount of information they can possibly hold. So yes, 100B ternary would be within the ballpark of a 30B q5 conventional model, with a lot of hand-waving and sufficiently-smart-training
I assume that theoretically, 1-bit models could be most efficient because modern models switched from 32 bit to 16 bit to 8 bit per parameter (without quantization).
Headline: 100B. Falcon 3 family: 10B. An order of magnitude off
If they had a big result like, native 1.58-bit quality clearly matches top peers, they would be saying that prominently in the repo.
The engineering/optimization work is nice, but this is not what people have been waiting for, as much as, can’t the Bitnet idea that seemed promise really deliver in a competitive way.
steve jobs would have loved the microsoft repo with demo on mac
headline hundred billion parameter, none of the official models are over 10 billion parameters. Curious.
The project is an inference framework which should support 100B parameter model at 5-7tok/s on CPU. No one has quantized a 100B parameter model to 1 trit, but this existing is an incentive for someone to do so.
What’s the lower limit on the number of bits per parameter? If you use CSR-style sparse matrices to store the weights can it be less than 1?
Misleading title but this is pretty exciting. Interesting how this is based on llama cpp. Its nice to see some momentum since they released the paper in 2023
Anyone know how hard it would be to create a 1-bit variant of one of the recent Qwen 3.5 models?
There are q2 and q1 quants, if you want an idea of how much performance you'd drop. Not quite the same implementation-wise, but probably equivalent in terms of smarts.
Almost trivial using open source tools, the question is how it performs without calibration/fine tuning.
The results would probably be underwhelming. The bitnet paper doesn't give great baselines to compare to, but in their tests a 2B network trained for 1.58bits using their architecture was better than Llama 3 8B quantized to 1.58bits. Though that 2B network was about on par with a 1.5B qwen2.5.
If you have an existing network, making an int4 quant is the better tradeoff. 1.58b quants only become interesting when you train the model specifically for it
On the other hand maybe it works much better than expected because llama3 is just a terrible baseline
I think the README [1] for the new CPU feature is of more interest, showing linear speedups with number of threads. Up to 73 tokens/sec with 8 threads (64 toks/s for their recommended Q6 quant):
https://github.com/microsoft/BitNet/blob/main/src/README.md
https://github-production-user-asset-6210df.s3.amazonaws.com...
demo shows a huge love for water, this AI knows its home
Also, very influenced by the literature of Jenkins (2010).
It might interest you to know that one or two months ago, I had Claude port BitNet to WebGPU from the reference implementation, so that it runs right in your browser as a local model. After some debugging, the port seemed to work, but the model didn't function as well as the reference implementation so I'll have to work on it for a while. You can see a debugging session livestreamed here[1]. The released model file was about a gigabyte, it fits in most people's GPU's. We were also able to successfully fine-tune it right in the browser.
There's a lot that you can do when the model size is that small, yet still powerful.
Our next step is that we want to put up a content distribution network for it where people can also share their diffs for their own fine-tuned model. I'll post the project if we finish all the parts.
[1] https://www.youtube.com/live/x791YvPIhFo?is=NfuDFTm9HjvA3nzN
No 100b model.
My disappointment is immeasurable and my day is ruined.