43 comments

  • djoldman 2 hours ago

    https://arxiv.org/abs/2410.00907

    ABSTRACT

    Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication (L-Mul) algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8 e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8 e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8 e4m3 as accumulation precision in both fine-tuning and inference.

    • onlyrealcuzzo an hour ago

      Does this mean you can train efficiently without GPUs?

      Presumably there will be a lot of interest.

      • crazygringo an hour ago

        No. But it does potentially mean that either current or future-tweaked GPUs could run a lot more efficiently -- meaning much faster or with much less energy consumption.

        You still need the GPU parallelism though.

        • fuzzfactor an hour ago

          I had a feeling it had to be something like massive waste due to a misguided feature of the algorithms that shouldn't have been there in the first place.

          Once the "math is done" quite likely it would have paid off better than most investments for the top people to have spent a few short years working with grossly underpowered hardware until they could come up with amazing results there before scaling up. Rather than grossly overpowered hardware before there was even deep understanding of the underlying processes.

          When you think about it, what we have seen from the latest ultra-high-powered "thinking" machines is truly so impressive. But if you are trying to fool somebody into believing that it's a real person it's still not "quite" there.

          Maybe a good benchmark would be to take a regular PC, and without reliance on AI just pull out all the stops and put all the effort into fakery itself. No holds barred, any trick you can think of. See what the electronics is capable of this way. There are some smart engineers, this would only take a few years but looks like it would have been a lot more affordable.

          Then with the same hardware if an AI alternative is not as convincing, something has got to be wrong.

          It's good to find out this type of thing before you go overboard.

          Regardless of speed or power, I never could have gotten an 8-bit computer to match the output of a 32-bit floating-point algorithm by using floating-point myself. Integers all the way and place the decimal where it's supposed to be when you're done.

          Once it's really figured out, how do you think it would feel being the one paying the electric bills up until now?

          • jimmaswell 15 minutes ago

            Faster progress was absolutely worth it. Spending years agonizing over theory to save a bit of electric would have been a massive disservice to the world.

  • kayo_20211030 2 hours ago

    Extraordinary claims require extraordinary evidence. Maybe it's possible, but consider that some really smart people, in many different groups, have been working diligently in this space for quite a while; so claims of 95% savings on energy costs _with equivalent performance_ is in the extraordinary category. Of course, we'll see when the tide goes out.

    • throwawaymaths an hour ago

      I don't think this claim is extraordinary. Nothing proposed is mathematically impossible or even unlikely, just a pain in the ass to test (lots of retraining, fine tuning etc, and those operations are expensive when you dont have already massively parallel hardware available, otherwise you're ASIC/FPGAing for something with a huge investment risk)

      If I could have a SWAG at it I would say a low resolution model like llama-2 would probably be just fine (llama-2 quantizes without too much headache) but a higher resolution model like llama-3 probably not so much, not without massive retraining anyways.

    • manquer an hour ago

      It is a click bait headline the claim itself is not extraordinary. the preprint from arxiv was posted here some time back .

      The 95% gains is specifically only for multiplication operations, inference is compute light and memory heavy in the first place so the actual gains would be far less smaller .

      Tech journalism (all journalism really) can hardly be trusted to publish grounded news with the focus on clicks and revenue they need to survive.

    • vlovich123 2 hours ago

      They’ve been working on unrelated problems like structure of the network or how to build networks with better results. There have been people working on improving the efficiency of the low-level math operations and this is the culmination of those groups. Figuring this stuff out isn’t super easy.

    • Randor an hour ago

      The energy claims up to ~70% can be verified. The inference implementation is here:

      https://github.com/microsoft/BitNet

      • kayo_20211030 an hour ago

        I'm not an AI person, in any technical sense. The savings being claimed, and I assume verified, are on ARM and x86 chips. The piece doesn't mention swapping mult to add, and a 1-bit LLM is, well, a 1-bit LLM.

        Also,

        > Additionally, it reduces energy consumption by 55.4% to 70.0%

        With humility, I don't know what that means. It seems like some dubious math with percentages.

        • Randor 42 minutes ago

          > I don't know what that means. It seems like some dubious math with percentages.

          I would start by downloading a 1.58 model such as: https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens

          Run the non-quantized version of the model on your 3090/4090 gpu and observe the power draw. Then load the 1.58 model and observe the power usage. Sure, the numbers have a wide range because there are many gpu/npu to make the comparison.

    • kayo_20211030 an hour ago

      re: all above/below comments. It's still an extraordinary claim.

      I'm not claiming it's not possible, nor am I claiming that it's not true, or, at least, honest.

      But, there will need to be evidence that using real machines, and using real energy an _equivalent performance_ is achievable. A defense that "there are no suitable chips" is a bit disingenuous. If the 95% savings actually has legs some smart chip manufacturer will do the math and make the chips. If it's correct, that chip making firm will make a fortune. If it's not, they won't.

    • stefan_ 35 minutes ago

      I mean, all these smart people would rather pay NVIDIA all their money than make AMD viable. And yet they tell us its all MatMul.

      • kayo_20211030 3 minutes ago

        Both companies are doing pretty well. Why don't you think AMD is viable?

  • remexre 3 hours ago

    Isn't this just taking advantage of "log(x) + log(y) = log(xy)"? The IEEE754 floating-point representation stores floats as sign, mantissa, and exponent -- ignore the first two (you quantitized anyway, right?), and the exponent is just an integer storing log() of the float.

    • mota7 2 hours ago

      Not quite: It's taking advantage of (1+a)(1+b) = 1 + a + b + ab. And where a and b are both small-ish, ab is really small and can just be ignored.

      So it turns the (1+a)(1+b) into 1+a+b. Which is definitely not the same! But it turns out, machine guessing apparently doesn't care much about the difference.

      • tommiegannert an hour ago

        Plus the 2^-l(m) correction term.

        Feels like multiplication shouldn't be needed for convergence, just monotonicity? I wonder how well it would perform if the model was actually trained the same way.

      • amelius an hour ago

        You might then as well replace the multiplication by the addition in the original network. In that case you're not even approximating anything.

        Am I missing something?

    • convolvatron 3 hours ago

      yes. and the next question is 'ok, how do we add'

      • kps 2 hours ago

        Yes. I haven't yet read this paper to see what exactly it says is new, but I've definitely seen log-based representations under development before now. (More log-based than the regular floating-point exponent, that is. I don't actually know the argument behind the exponent-and-mantissa form that's been pretty much universal even before IEEE754, other than that it mimics decimal scientific notation.)

      • dietr1ch 2 hours ago

        I guess that if the bulk of the computation goes into the multiplications, you can work in the log-space and simply sum, and when the time comes to actually do a sum on the original space you can go back and sum.

        • a-loup-e 2 hours ago

          Not sure how well that would work if you're often adding bias after every layer

  • quantadev 8 minutes ago

    I wonder if someone has feed this entire "problem" into the latest Chat GPT-01 (the new model with reasoning capability), and just fed it in all the code for a Multilayer Perceptron and then given it the task/prompt of finding ways to implement the same network using only integer operations.

    Surely even the OpenAI devs must have done this like the minute they got done training that model, right? I wonder if they'd even admit it was an AI that came up with the solution rather than just publishing it, and taking credit. haha.

  • panosv 37 minutes ago

    Lemurian Labs looks like it's doing something similar: https://www.lemurianlabs.com/technology They use the Logarithmic Number System (LNS)

  • _aavaa_ 2 hours ago

    Original discussion of the preprint: https://news.ycombinator.com/item?id=41784591

  • idiliv an hour ago

    Duplicate, posted on October 9: https://news.ycombinator.com/item?id=41784591

  • GistNoesis an hour ago
  • asicsarecool an hour ago

    Don't assume this isn't already in place at the main AI companies

  • syntaxing 2 hours ago

    I’m looking forward to Bitnet adaptation. MS just released a tool for it similar to llamacpp. Really hoping major models get retrained for it.

  • andrewstuart 27 minutes ago

    Here is the Microsoft implementation:

    https://github.com/microsoft/BitNet

  • hello_computer an hour ago

    How does this differ from Cussen & Ullman?

    https://arxiv.org/abs/2307.01415

  • robomartin 2 hours ago

    I posted this about a week ago:

    https://news.ycombinator.com/item?id=41816598

    This has been done for decades in digital circuits, FPGA’s, Digital Signal Processing, etc. Floating point is both resource and power intensive and using FP without the use of dedicated FP processing hardware is something that has been avoided and done without for decades unless absolutely necessary.

    • fidotron 18 minutes ago

      Right, the ML people are learning, slowly, about the importance of optimizing for silicon simplicity, not just reduction of symbols in linear algebra.

      Their rediscovery of fixed point was bad enough but the “omg if we represent poses as quaternions everything works better” makes any game engine dev for the last 30 years explode.

    • ausbah an hour ago

      a lot of things in the ML research space are rebranding an old concept w a new name as “novel”

    • ujikoluk an hour ago

      Explain more for the uninitiated please.

  • andrewstuart 2 hours ago

    The ultimate “you’re doing it wrong”.

    For he sake of the climate and environment it would be nice to be true.

    Bad news for Nvidia. “Sell your stock” bad.

    Does it come with a demonstration?

    • mouse_ an hour ago

      Hypothetically, if this is true and simple as the headline implies -- AI using 95% less power doesn't mean AI will use 95% less power, it means we will do 20x more AI. As long as it's the current fad, we will throw as much power and resources at this as we can physically produce, because our economy depends on constant, accelerating growth.

    • talldayo an hour ago

      > Bad news for Nvidia. “Sell your stock” bad.

      People say this but then the fastest and most-used implementation of these optimizations is always written in CUDA. If this turns out to not be a hoax, I wouldn't be surprised to see Nvidia prices jump in correlation.

  • DesiLurker an hour ago

    validity of the claim aside, why dont they say reduces by 20 times instead of 95%. its much better perspective of a fraction when fraction is tiny.