81 comments

  • jjcm 6 hours ago

    1 bit with a FP16 scale factor every 128 bits. Fascinating that this works so well.

    I tried a few things with it. Got it driving Cursor, which in itself was impressive - it handled some tool usage. Via cursor I had it generate a few web page tests.

    On a monte carlo simulation of pi, it got the logic correct but failed to build an interface to start the test. Requesting changes mostly worked, but left over some symbols which caused things to fail. Required a bit of manual editing.

    Tried a Simon Wilson pelican as well - very abstract, not recognizable at all as a bird or a bicycle.

    Pictures of the results here: https://x.com/pwnies/status/2039122871604441213

    There doesn't seem to be a demo link on their webpage, so here's a llama.cpp running on my local desktop if people want to try it out. I'll keep this running for a couple hours past this post: https://unfarmable-overaffirmatively-euclid.ngrok-free.dev

    • najarvg 6 hours ago

      Thanks for sharing the link to your instance. Was blazing fast in responding. Tried throwing a few things at it with the following results: 1. Generating an R script to take a city and country name and finding it's lat/long and mapping it using ggmaps. Generated a pretty decent script (could be more optimal but impressive for the model size) with warnings about using geojson if possible 2. Generate a latex script to display the gaussian integral equation - generated a (I think) non-standard version using probability distribution functions instead of the general version but still give it points for that. Gave explanations of the formula, parameters as well as instructions on how to compile the script using BASH etc 3. Generate a latex script to display the euler identity equation - this one it nailed.

      Strongly agree that the knowledge density is impressive for the being a 1-bit model with such a small size and blazing fast response

      • jjcm 6 hours ago

        > Was blazing fast in responding.

        I should note this is running on an RTX 6000 pro, so it's probably at the max speed you'll get for "consumer" hardware.

        • ineedasername 4 hours ago

          consumer hardware?

          That... pft. Nevermind, I'm just jealous

          • jjcm 3 hours ago

            Look it was my present to myself after the Figma IPO (worked there 5 years). If you want to feel less jealous, look at the stock price since then.

        • abrookewood 3 hours ago

          Holy hell ... that's a monster of a card

      • najarvg 6 hours ago

        I must add that I also tried out the standard "should I walk or drive to the carwash 100 meters away for washing the car" and it made usual error or suggesting a walk given the distance and health reasons etc. But then this does not claim to be a reasoning model and I did not expect, in the remotest case, for this to be answered correctly. Ever previous generation larger reasoning models struggle with this

        • jjcm 6 hours ago

          I ran it through a rudimentary thinking harness, and it still failed, fwiw:

              The question is about the best mode of transportation to a car wash located 100 meters away. Since the user is asking for a recommendation, it's important to consider practical factors like distance, time, and convenience.
          
              Walking is the most convenient and eco-friendly option, especially if the car wash is within a short distance. It avoids the need for any transportation and is ideal for quick errands.
              Driving is also an option, but it involves the time and effort of starting and stopping the car, parking, and navigating to the location.
              Given the proximity of the car wash (100 meters), walking is the most practical and efficient choice. If the user has a preference or if the distance is longer, they can adjust accordingly.
    • adityashankar 6 hours ago

      here's the google colab link, https://colab.research.google.com/drive/1EzyAaQ2nwDv_1X0jaC5... since the ngrok like likely got ddosed by the number of individuals coming along

      • qingcharles 43 minutes ago

        Thanks, that works. I only tested the 1.7B. It has that original GPT3 feel to it. Hallucinates like crazy when it doesn't know something. For something that will fit on a GTX1080, though, it's solid.

        We're only a couple of years into optimization tech for LLMs. How many other optimizations are we yet to find? Just how small can you make a working LLM that doesn't emit nonsense? With the right math could we have been running LLMs in the 1990s?

      • jjcm 6 hours ago

        Good call. Right now though traffic is low (1 req per min). With the speed of completion I should be able to handle ~100x that, but if the ngrok link doesn't work defo use the google colab link.

        • adityashankar 6 hours ago

          The link didn't work for me personally, but that may be a bandwidth issue with me fighting for a connection in the EU

    • andai 4 hours ago

      Thanks. Did you need to use Prism's llama.cpp fork to run this?

      • jjcm 3 hours ago

        Yep.

        • andai 3 hours ago

          Could you elaborate on what you did to get it working? I built it from source, but couldn't get it (the 4B model) to produce coherent English.

          Sample output below (the model's response to "hi" in the forked llama-cli):

          X ( Altern as the from (.. Each. ( the or,./, and, can the Altern for few the as ( (. . ( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern. . That, on, and similar, and, similar,, and, or in

          • freakynit 2 hours ago

            I have older M1 air with 8GB, but still getting ober 23 t/s on 4B model.. and the quality of outputs is on par with top models of similar size.

            1. Clone their forked repo: `git clone https://github.com/PrismML-Eng/llama.cpp.git`

            2. Then (assuming you already have xcode build tools installed):

              cd llama.cpp
              cmake -B build -DGGML_METAL=ON
              cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)
            
            3. Finally, run it with (you can adjust arguments):

              ./build/bin/llama-server -m ~/Downloads/Bonsai-8B.gguf --port 80 --host 0.0.0.0 --ctx-size 0 --parallel 4 --flash-attn on --no-perf --log-colors on --api-key some_api_key_string
            
            Model was first downloaded from: https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main
            • freakynit an hour ago

              To the author: why is this taking 4.56GB ? I was expecting this to be under 1GB for 4B model. https://ibb.co/CprTGZ1c

              And this is when Im serving zero prompts.. just loaded the model (using llama-server).

          • jjcm 2 hours ago

            I did this: https://image.non.io/2093de83-97f6-43e1-a95e-3667b6d89b3f.we...

            Literally just downloaded the model into a folder, opened cursor in that folder, and told it to get it running.

            Prompt: The gguf for bonsai 8b are in this local project. Get it up and running so I can chat with it. I don't care through what interface. Just get things going quickly. Run it locally - I have plenty of vram. https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main

            I had to ask it to increase the context window size to 64k, but other than that it got it running just fine. After that I just told ngrok the port I was serving it on and voila.

    • rjh29 5 hours ago

      I reminds me of very early ChatGPT with mostly correct answers but some nonsense. Given its speed, it might be interesting to run it through a 'thinking' phase where it double checks its answers and/or use search grounding which would make it significantly more useful.

    • uf00lme 6 hours ago

      The speed is impressive, I wish it could be setup for similar to speculative decoding

    • abrookewood 3 hours ago

      man, that is really really quick. What is your desktop setup??? GPU?

    • pdyc 3 hours ago

      thanks, i tested it, failed in strawberry test. qwen 3.5 0.8B with similar size passes it and is far more usable.

      • selcuka 2 hours ago

        Interesting. Qwen 3.5 0.8B failed the test for me.

    • hmokiguess 6 hours ago

      wow that was cooler than I expected, curious to embed this for some lightweight semantic workflows now

  • simonw 2 hours ago

    You can run this model on an iPhone via the latest update to this Locally AI app: https://apps.apple.com/us/app/locally-ai-local-ai-chat/id674...

    For its size (1.2GB download) it's very impressive.

    Here's a pelican it drew me running on my phone - the SVG comments are good, the image not so much: https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...

    • voxelghost an hour ago

      Did you ask for a pelican with a bicycle, or was that just an added bonus?

  • freakynit 20 minutes ago

    Open access for next 5 hours (8GiB model, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

    https://ofo1j9j6qh20a8-80.proxy.runpod.net

      ./build/bin/llama-server \
       -m ../Bonsai-8B.gguf \
       -ngl 999 \
       --flash-attn on \
       --host 0.0.0.0 \
       --port 80 \
       --ctx-size 65500 \
       --batch-size 512 \
       --ubatch-size 512 \
       --parallel 5 \
       --cont-batching \
       --threads 8 \
       --threads-batch 8 \
       --cache-type-k q4_0 \
       --cache-type-v q4_0 \
       --log-colors on
    
    The server can serve 5 parallel request, with each request capped at around `13K` tokens...

    A bit of of benchmarks I did:

    1. Input: 700 tokens, ttfs: ~0 second, outputs: 1822 tokens ~190t/s

    1. Input: 6400+ tokens, ttfs: ~2 second, outputs: 2012 tokens at ~135t/s

    Vram usage was consistently at ~4GiB.

  • wild_egg 5 hours ago

    Don't have a GPU so tried the CPU option and got 0.6t/s on my old 2018 laptop using their llama.cpp fork.

    Then found out they didn't implement AVX2 for their Q1_0_g128 CPU kernel. Added that and getting ~12t/s which isn't shabby for this old machine.

    Cool model.

    • UncleOxidant 3 hours ago

      Are you getting anything besides gibberish out of it? I tried their recommended commandline and it's dog slow even though I built their llama.cpp fork with AVX2 enabled. This is what I get:

          $ ./build/bin/llama-cli     -hf prism-ml/Bonsai-8B-gguf -p "Explain quantum computing in simple terms." -n 256 --temp 0.5 --top-p 0.85 --top-k 20 -ngl 99
          > Explain quantum computing in simple terms.
      
           \( ,
      
            None ( no for the. (,./. all.2... the                                                                                                                                ..... by/
      
      
      EDIT: It runs fine in their collab notebook. Looking at that you have to do: git checkout prism (in the llama.cpp repo) before you build. That's a missing instruction if you're going straight to their fork of llama.cpp. Works fine now.
    • cubefox 3 hours ago

      "Not shabby" is a big understatement.

    • 5 hours ago
      [deleted]
  • alyxya 7 hours ago

    I expect the trend of large machine learning models to go towards bits rather than operating on floats. There's a lot of inefficiency in floats because typically they're something like normally distributed, which makes the storage and computation with weights inefficient when most values are clustered in a small range. The foundation of neural networks may be rooted in real valued functions, which are simulated with floats, but float operations are just bitwise operations underneath. The only issue is that GPUs operate on floats and standard ML theory works over real numbers.

    • cubefox 3 hours ago

      > and standard ML theory works over real numbers.

      This paper uses binary numbers only, even for training, with a solid theoretical foundation: https://proceedings.neurips.cc/paper_files/paper/2024/file/7...

      TL;DR: They invent a concept called "Boolean variation" which is the binary analog to the Newton/Leibniz derivative. They are then able to do backpropagation directly in binary.

  • drob518 4 hours ago

    I’m really curious how this scales up. Bonsai delivers an 8B model in 1.15 GB. How large would a 27B or 35B model be? Would it still retain the accuracy of those large models? If the scaling holds, we could see 100+B models in 64 GB of RAM.

    • cubefox 3 hours ago

      Also depends on how expensive training these models is. It's probably at least as expensive as full precision models, otherwise they would have mentioned it.

  • andai 4 hours ago

    Does anyone know how to run this on CPU?

    Do I need to build their llama.cpp fork from source?

    Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!

    Edit: 30 minutes of C++ compile time later, I got it running. Although it uses 7GB of RAM then hangs at Loading model. I thought this thing was less memory hungry than 4 bit quants?

    Edit 2: Got the 4B version running, but at 0.1 tok/s and the output seemed to be nonsensical. For comparison I can run, on the same machine, qwen 3.5 4B model (at 4 bit quant) correctly and about 50x faster.

  • andai 3 hours ago

    The site says 14x less memory usage. I'm a bit confused about that situation. The model file is indeed very small, but on my machine it used roughly the same RAM as 4 bit quants (on CPU).

    Though I couldn't get actual English output from it, so maybe something went wrong while running it.

  • _fw 7 hours ago

    What’s the trade-off? If it’s smaller, faster and more efficient - is it worse performance? A layman here, curious to know.

    • kvdveer 7 hours ago

      Their own (presumably cherry picked) benchmarks put their models near the 'middle of the market' models (llama3 3b, qwen3 1.7b), not competing with claude, chatgtp, or gemini. These are not models you'd want to directly interact with. but these models can be very useful for things like classification or simple summarization or translation tasks.

      These models quite impressive for their size: even an older raspberry pi would be able to handle these.

      There's still a lots of use for this kind of model

    • adityashankar 6 hours ago

      If you look at their whitepaper (https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-b...) you'll notice that it does have some tradeoffs due to model intelligence being reduced (page 10)

      The average of MMLU Redux,MuSR,GSM8K,Human Eval+,IFEval,BFCLv3 for this model is 70.5 compared to 79.3 for Qwen3, that being said the model is also having a 16x smaller size and is 6x faster on a 4090....so it is a tradeoff that is pretty respectable

      I'd be interested in fine tuning code here personally

  • ggamezar an hour ago

    Misses comparison with qwen 3.5, though mentioned qwen 3. Is there a reason why?

  • plombe 4 hours ago

    Interesting post. Curious to know how they arrived at intelligence density = Negative log of the model's error rate divided by the model size.

  • kent8192 4 hours ago

    Oh, boy. This good tool hates my LM Studio... The following message appears when I run Bonsai in my LM Studio. I think my settings have done something wrong. ``` Failed to load the model Error loading model. (Exit code: null). Please check the settings and try loading the model again. ```

    • dodos an hour ago

      Same issue here, wanted to give it a shot but ran into that error trying to load the model in lm studio.

    • liuliu 4 hours ago

      It needs a mlx fork because the lowest bit in mlx is 2 currently (for affine quantization).

  • Archit3ch 7 hours ago

    Doesn't Jevons paradox dictate larger 1-bit models?

    • wmf 5 hours ago

      Yeah, hopefully they release >100B models.

  • ycui1986 2 hours ago

    i hope someone do a 100b 1-bit parameter model. that should fit into most 16GB graphics cards. local AI democratized.

  • bilsbie 5 hours ago

    I can’t see how this is possible. You’re losing so much information.

    • MarsIronPI 4 hours ago

      It's because they're natively trained with 1 bit, so it's not losing anything. Now, the question might be how they manage to get decent predictive performance with such little precision. That I don't know.

      • syntaxpr an hour ago

        Not training. Transposing rows/columns of matrices to group 128 parameters with similar (shared) scale factor. Qwen-3 model.

    • txrx0000 3 hours ago

      I always remind myself and everyone else that human DNA is "only" 1.6 GB of data, and yet it encodes all of the complex systems of the human body including the brain, and can replicate itself. Our intuitive feel of how much stuff can be packed into how many bits are probably way off from the true limits of physics.

      • kennywinker 44 minutes ago

        And anybody who’s ever met a baby can tell you, they score very poorly on most llm benchmarks.

  • syntaxing 7 hours ago

    Super interesting, building their llama cpp fork on my Jetson Orin Nano to test this out.

  • keyle 6 hours ago

    Extremely cool!

    Can't wait to give it a spin with ollama, if ollama could list it as a model that would be helpful.

  • ariwilson 6 hours ago

    Very cool and works pretty well!

    • onlyrealcuzzo 5 hours ago

      I'm fascinated by these smaller models.

      The amount of progress they've been making is incredible.

      Is anyone following this space more closely? Is anyone predicting performance at certain parameter sizes will plateau soon?

      Unlike the frontier models, these don't seem to be showing much progress of slowing down.

  • yodon 8 hours ago

    Is Bonsai 1 Bit or 1.58 Bit?

    • woadwarrior01 7 hours ago

      1-bit g128 with a shared 16-bit scale for every group. So, effectively 1.125 bit.

    • mchusma 3 hours ago

      I was excited about the 1.58 bit models from a year or two ago, but the never seemed to go anywhere. Curious in particular how this scales up.

    • NooneAtAll3 an hour ago

      1 bit or 1 trit*

  • hatthew 6 hours ago

    I feel like it's a little disingenuous to compare against full-precision models. Anyone concerned about model size and memory usage is surely already using at least an 8 bit quantization.

    Their main contribution seems to be hyperparameter tuning, and they don't compare against other quantization techniques of any sort.

  • marak830 5 hours ago

    It's been a hell of a morning for llama heads - first this, then the claude drop and turboquant.

    I'm currently setting this one up, if it works well with a custom LoRa ontop ill be able to run two at once for my custom memory management system :D

  • OutOfHere 7 hours ago

    How do I run this on Android?

    • najarvg 6 hours ago

      Pocket Pal is what I've seen used before. Although recently heard about "Off Grid" but not read any reviews about it or tried it personally so caveat emptor. Will see if the community has other suggestions

  • stogot 7 hours ago

    What is the value of a 1 bit? For those that do not kno

    • jacquesm 7 hours ago

      That you can process many operations with a single instruction.

    • SwellJoe 7 hours ago

      0 or 1

      • jjcm 6 hours ago

        Technically not in this case, or not effectively. The 0 or 1 correspond to a FP16 scaling factor for each group of 128 bits. The value fluctuates between each group of 128.

    • fgfarben 4 hours ago

      I can port it to an FPGA and so can you.

    • trebligdivad 7 hours ago

      Speed and density.

  • zephyrwhimsy 5 hours ago

    Cursor and similar AI-native IDEs are interesting not because of the AI itself, but because they demonstrate that the IDE paradigm is not settled. There is room for fundamental rethinking of how developers interact with codebases.