BitNet b1.58 2B4T Technical Report

(arxiv.org)

110 points | by galeos 5 days ago ago

32 comments

  • nopelynopington 5 days ago

    I built it at home this morning and tried it, perhaps my expectations were high but I wasn't terribly impressed. I asked it for a list of ten types of data I might show on a home info display panel. It gave me three. I clarified that I wanted ten, it gave me six. Every request after that just returned the same six things.

    I know it's not chatGPT4 but I've tried other very small models that run on CPU only and had better results

    • Me1000 5 days ago

      This is a technology demo, not a model you'd want to use. Because Bitnet models are only average 1.58 bits per weight you'd expect to need the model to be much larger than your fp8/fp16 counterparts in terms of parameter count. Plus this is only a 2 billion parameter model in the first place, even fp16 2B parameter models generally perform pretty poorly.

      • nopelynopington 5 days ago

        Ok that's fair. I still think something was up with my build though, the online demo worked far better than my local build

    • ashirviskas 5 days ago

      > I've tried other very small models that run on CPU only and had better results

      Maybe you can you share some comparative examples?

      • nopelynopington 5 days ago

        sure, here's my conversation with BitNet b1.58 2B4T

        https://pastebin.com/ZZ1tADvp

        here's the same prompt given to smollm2:135m

        https://pastebin.com/SZCL5WkC

        The quality of the second results are not fantastic. The data isn't public, and it repeats itself mentioning income a few times. I don't think I would use either of these models for accurate data but I was surprised at the truncated results from bitnet

        Smollm2:360M returned better quality results, no repetition, but it did suggest things which didn't fit the brief exactly (public data given location only)

        https://pastebin.com/PRFqnqVF

        Edit:

        I tried the same query on the live demo site and got much better results. Maybe something went wrong on my end?

  • akoboldfrying 5 days ago

    They give some description of how their weights are stored: they pack 4 weights into an int8, indicating that their storage format isn't optimal (2 bits per weight instead of the optimal ~1.58 bits). But I don't know enough about LLM internals to know how material this is.

    Could anyone break down the steps further?

    • Fubwubs 5 days ago

      This model maps weights to ternary values {-1, 0, 1} (aka trits). One trit holds log(3)/log(2) ≈ 1.58 bits of information. To represent a single trit by itself would require 2 bits, but it is possible to pack 5 trits into 8 bits. This article explains it well: https://compilade.net/blog/ternary-packing

      By using 4 ternary weights per 8 bits, the model is not quite as space-efficient as it could be in terms of information density. (4*1.58)/8 = 0.79 vs (5*1.58)/8 = 0.988 There is currently no hardware acceleration for doing operations on 5 trits packed into 8 bits, so the weights have to be packed and unpacked in software. Packing 5 weights into 8 bits requires slower, more complex packing/unpacking algorithms.

      • akoboldfrying 4 days ago

        That link gives a great description of how to pack trits more efficiently, thanks. Encoding in "base 3" was obvious to me, but I didn't realise that 5 trits fit quite tightly into a byte, or that it's possible to "space the values apart" so that they can be extracted using just multiplications and bitwise ops (no division or remainder).

  • Havoc 5 days ago

    Is there a reason why the 1.58 ones are always aimed at quite small ones? Think I’ve seen an 8B but that’s about it.

    Is there a technical reason for it or just research convenience ?

    • londons_explore 5 days ago

      I suspect because current GPU hardware can't efficiently train such low bit depth models. You end up needing activations to use 8 or 16 bits in all the data paths, and don't get any more throughput per cycle on the multiplications than you would have done with FP32.

      Custom silicon would solve that, but nobody wants to build custom silicon for a data format that will go out of fashion before the production run is done.

      • zamadatix 5 days ago

        The custom CUDA kernel for 4-in-8 seems to have come out better than a naive approach (such as just treating each as an fp8/int8) + it lowers memory bandwidth. Custom hardware would certainly make that improvement even better but I don't think that's what's limiting training to 2-8 billion parameters as much as something like research convenience while the groundwork for this type of model is still being figured out.

      • Havoc 5 days ago

        Makes sense. Might be good for mem throughput constrained devices though so hoping it’ll pick up

    • yieldcrv 5 days ago

      They aren’t, there is a 1.58 version of deepseek that’s like 200gb instead of 700

      • logicchains 5 days ago

        That's not a real BitNet, it's just a post-training quantisation, and its performance suffers compared to if it was trained from scratch at 1.58 bits.

  • galeos 5 days ago

    You can try out the model in a demo they have setup: https://bitnet-demo.azurewebsites.net/

  • Thoreandan 5 days ago

    I guess B1FF@BITNET posts are gonna come from an LLM now.

    Context: https://web.archive.org/web/20030830105202/http://www.catb.o...

  • balazstorok 5 days ago

    Does someone have a good understanding how 2B models can be useful in production? What tasks are you using them for? I wonder what tasks you can fine-tune them on to produce 95-99% results (if anything).

    • nialse 5 days ago

      The use case for small models include sentiment and intent analysis, spam and abuse detection, and classifications of various sorts. Generally LLM are thought of as chat models but the output need not be a conversation per se.

      • mhitza 5 days ago

        My impression was that text embeddings are better suited for classification. Of course the big caveat is that the embeddings must have "internalized" the semantic concept you're trying to map.

        From some article I have in my draft, experimenting with open source text embeddings:

            ./match venture capital
            purchase           0.74005488647684
            sale               0.80926752301733
            place              0.81188663814236
            positive sentiment 0.90793311875207
            negative sentiment 0.91083707598925
            time               0.9108697315425
         
            ./store sillicon valley
            ./match venture capital
            sillicon valley    0.7245139487301
            purchase           0.74005488647684
            sale               0.80926752301733
            place              0.81188663814236
            positive sentiment 0.90793311875207
            negative sentiment 0.91083707598925
            time               0.9108697315425
        
        Of course you need to figure out what these black boxes understand. For example for sentiment analysis, instead of having it match against "positive" "negative" you would have the matching terms be "kawai" and "student debt". Depending how the text embedding internalized negatives and positives based on their training data.
    • snovv_crash 5 days ago

      Anything you'd normally train a smaller custom model for, but with an LLM you can use a prompt instead of training.

    • logicchains 5 days ago

      2B models by themselves aren't so useful, but it's very interesting as a proof of concept, because the same technique used to train a 200B model could produce one that's much more efficient (cheaper and more environmentally friendly) than existing 200B models, especially with specialised hardware support.

    • meltyness 5 days ago

      I'm more interested in how users are taking 95-99% to 99.99% for generation-assisted tasks. I haven't seen a review or study of techniques, even though on the ground it's pretty trivial to think of some candidates.

      • oezi 5 days ago

        Three strategies seem to be:

        - Use LLM to evaluate result and retry if it doesn't match.

        - let users trigger a retry

        - let users edit

    • future10se 5 days ago

      The on-device models used for Apple Intelligence (writing tools, notification and email/message summaries, etc.) are around ~3B parameters.

      I mean, they could be better (to put it nicely), but there is a legitimate use-case for them and I'd love to see more work in this space.

      https://machinelearning.apple.com/research/introducing-apple...

      https://arxiv.org/abs/2407.21075

    • Lapel2742 5 days ago

      I'm just playing / experimenting around with local LLM's. Just to see what I can do with them. One thing that comes to mind is gaming: E.g. text/dialog generation in procedural worlds / adventures.

    • throwaway314155 5 days ago

      Summarization on mobile/embedded might be a good usecase?

      • 5 days ago
        [deleted]
  • 5 days ago
    [deleted]
  • rbanffy 5 days ago

    Not to be confused with BITNET

    https://en.m.wikipedia.org/wiki/BITNET

  • rcMgD2BwE72F 5 days ago

    I ask about the last French election and the #1 sentence is:

    >Marine Le Pen, a prominent figure in France, won the 2017 presidential election despite not championing neoliberalism. Several factors contributed to her success: (…)

    What data did they train their model on?