Microsoft BitNet: inference framework for 1-bit LLMs

(github.com)

67 points | by galeos 7 hours ago ago

14 comments

  • newfocogi 2 hours ago

    I'm enthusiastic about BitNet and the potential of low-bit LLMs - the papers show impressive perplexity scores matching full-precision models while drastically reducing compute and memory requirements. What's puzzling is we're not seeing any major providers announce plans to leverage this for their flagship models, despite the clear efficiency gains that could theoretically enable much larger architectures. I suspect there might be some hidden engineering challenges around specialized hardware requirements or training stability that aren't fully captured in the academic results, but would love insights from anyone closer to production deployment of these techniques.

    • swfsql an hour ago

      I think that since training must happen on a non-bitnet architecture, tuning towards bitnet is always a downgrade on it's capabilities, so they're not really interested in it. But maybe they could be if they'd offer cheaper plans, since it's efficiency is relatively good.

      I think the real market for this is for local inference.

    • waynenilsen 43 minutes ago

      I suppose hardware support would be very helpful, new instructions for bitpacked operations?

    • strangescript 2 hours ago

      I find it a little confusing as well. I wonder if its because so many of these companies have went all in on the "traditional" approach that deviating now seems like a big shift?

  • zamadatix 2 hours ago

    For anyone that hasn't read the previous papers before the "1.58-bit" part comes from using 3 values (-1, 0, 1) and log2[3]=1.58...

  • alkh 22 minutes ago

    Sorry for a stupid question but to clarify, even though it is a 1-bit model, it is supposed to be working with any types of embeddings, even taken from larger LLMs(in their example, they use HF1BitLLM/Llama3-8B-1.58-100B-tokens). I.e. it doesn't have an embedding layer built-in and relies on embedding provided separately?

  • wwwtyro an hour ago

    Can anyone help me understand how this works without special bitnet precision-specific hardware? Is special hardware unnecessary? Maybe it just doesn't reach the full bitnet potential without it? Or maybe it does, with some fancy tricks? Thanks!

    • hansvm an hour ago

      I haven't checked this one out yet, but a common trick is using combinations of instructions and data invariants allowing you to work in "lanes".

      The easiest example is xor, which can trivially be interpreted as either xoring one large integer or xoring a vector of smaller integers.

      Take a look at the SWAR example here [0] as a pretty common/easy example of that technique being good for something in the real world.

      Dedicated hardware is almost always better, but you can still get major improvements with a little elbow grease.

      [0] https://nimrod.blog/posts/algorithms-behind-popcount/

      • 15155 an hour ago

        This is extremely easy to implement in-FPGA.

    • eightysixfour an hour ago

      While fancy hardware would make it faster, what you are comparing it to is a bunch of floating point and large number multiplication. I believe in this case they just use a look up table:

      If one value is 0, it is 0.

      If the signs are different, it is -1.

      If the signs are the same, it is 1.

      I’m sure those can be done with relatively few instructions using far less power hungry hardware.

  • lostmsu 2 hours ago

    No GPU inference support?

    • diggan an hour ago

      > that support fast and lossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next).

  • faragon an hour ago

    I'm glad Microsoft uses Bash in the example, instead of their own Windows shells. As a user I would like having something like "Git Bash" for Windows built in the system, as default shell.

    • not_a_bot_4sho 8 minutes ago

      WSL is where it's at today. It's not quite what you're asking for, as it is a separate virtual OS, but the integration is so tight that it feels like you're using your favorite shell natively in Windows.