54 comments

  • simonw 3 hours ago

    It turns out someone has written a plugin for my LLM CLI tool already: https://github.com/irthomasthomas/llm-cerebras

    You need an API key - I got one from https://cloud.cerebras.ai/ but I'm not sure if there's a waiting list at the moment - then you can do this:

        pipx install llm # or brew install llm or uv tool install llm
        llm install llm-cerebras
        llm keys set cerebras
        # paste key here
    
    Then you can run lightning fast prompts like this:

        llm -m cerebras-llama3.1-70b 'an epic tail of a walrus pirate'
    
    Here's a video of that running, it's very speedy: https://static.simonwillison.net/static/2024/cerebras-is-fas...
    • croes 27 minutes ago

      It has a waiting list

    • londons_explore 2 hours ago

      The "AI overview" in google search seems to be a similar speed, and the resulting text of similar quality.

      • simonw 2 hours ago

        I wonder which of their models they use. Might even be Gemini 1.5 Flash 8B which is VERY quick.

        I just tried that out with the same prompt and it's fast, but not as fast as Cerebras: https://static.simonwillison.net/static/2024/gemini-flash-8b...

        • londons_explore 2 hours ago

          I suspect it is its own model. Running it on 10B+ user queries per day you're gonna want to optimize everything you can about it - so you'd want something really optimized to the exact problem rather than using a general purpose model with careful prompting.

  • fancyfredbot 6 minutes ago

    Wow, software is hard! Imagine an entire company working to build an insanely huge and expensive wafer scale chip and your super smart and highly motivated machine learning engineers get 1/3 of peak performance on their first attempt. When people say NVIDIA has no moat I'm going to remember this - partly because it does show that they do, and partly because it shows that with time the moat can probably be crossed...

  • maz1b an hour ago

    Cerebras really has impressed me with their technicality and their approach in the modern LLM era. I hope they do well, as I've heard they are en-route to IPO. It will be interesting to see if they can make a dent vs NVIDIA and other players in this space.

    • madaxe_again 24 minutes ago

      Apparently so. You can also buy in via various PE outfits before IPO, if you so desire. I did.

  • GavCo 2 hours ago

    When Meta releases the quantized 70B it will give another > 2X speedup with similar accuracy: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...

  • obviyus 2 hours ago

    Wonder if they'll eventually release Whisper support. Groq has been great for transcribing 1hr+ calls at a significnatly lower price compared to OpenAI ($0.36/hr vs. $0.04/hr).

    • BrunoJo 7 minutes ago

      https://Lemonfox.ai is another alternative to OpenAI's Whisper API if you need support for word-level timestamps and diarization.

    • Arn_Thor 32 minutes ago

      Whisper runs so well locally on any hardware I’ve thrown at it, why run it in the cloud?

      • swores 28 minutes ago

        Does it run well on CPU? I've used it locally but only with my high end (consumer/gaming) GPU, and haven't got round to finding out how it does on weaker machines.

      • obviyus 25 minutes ago

        That's pretty much exactly how I started. Ran whisper.cpp locally for a while on a 3070Ti. It worked quite well when n=1.

        For our use case, we may get 1 audio file at a time, we may get 10. Of course queuing them is possible but we decided to prioritize speed & reliability over self hosting.

  • asabla 3 hours ago

    Damn, that's some impressive speeds.

    At that rate it doesn't matter if the first try resulted in an unwanted answer, you'll be able to run once or twice more in a fast succession.

    I hope their hardware stays relevant as this field continues to evolve

    • tjoff 3 hours ago

      The biggest time sink for me is validating answers so not sure I agree on that take.

      Fast iteration is a killer feature, for sure, but at this time I'd rather focus on quality for it to be worthwhile the effort.

      • vineyardmike 2 hours ago

        If you're using an LLM as a compressed version of a search index, you'll be constantly fighting hallucinations. Respectfully, you're not thinking big-picture enough.

        There are LLMs today that are amazing at coding, and when you allow it to iterate (eg. respond to compiler errors), the quality is pretty impressive. If you can run an LLM 3x faster, you can enable a much bigger feedback loop in the same period of time.

        There are efforts to enable LLMs to "think" by using Chain-of-thought, where the LLM writes out reasoning in a "proof" style list of steps. Sometimes, like with a person, they'd reach a dead-end logic wise. If you can run 3x faster, you can start to run the "thought chain" as more of a "tree" where the logic is critiqued and adapted, and where many different solutions can be tried. This can all happen in parallel (well, each sub-branch).

        Then there are "agent" use cases, where an LLM has to take actions on its own in response to real-world situations. Speed really impacts user-perception of quality.

        • phito 25 minutes ago

          > There are LLMs today that are amazing at coding, and when you allow it to iterate (eg. respond to compiler errors), the quality is pretty impressive. If you can run an LLM 3x faster, you can enable a much bigger feedback loop in the same period of time.

          Well now the compiler is the bottleneck isn't it? And you would still need human check for bugs that aren't caught by the compiler.

          Still nice to have inference speed improvements tho.

        • tjoff 2 hours ago

          If the speed is used to get better quality with no more input from the user then sure, that is great. But that is not the only way to get better quality (though I agree that there are some low hanging fruit in the area).

        • OhNoNotAgain_99 2 hours ago

          To be honest most LLM's are reasonable at coding, they're not great. Sure they can code small stuff. But the can't refactor large software projects, or upgrade them.

          • regularfry an hour ago

            Upgrading large java projects is exactly what AWS want you to believe their tooling can do, but the ergonomics aren't great.

            I think most of the capability problems with coding agents aren't the AI itself, it's that we haven't cracked how to let them interact with the codebase effectively yet. When I refactor something, I'm not doing it all at once, it's a step by step process. None of the individual steps are that complicated. Translating that over to an agent feels like we just haven't got the right harness yet.

      • jeswin 3 hours ago

        > The biggest time sink for me is validating answers so not sure I agree on that take.

        But you're assuming that it'll always ne validated by humans. I'd imagine that most validation (and subsequent processing, especially going forward) will be done on machines.

        • croes 24 minutes ago

          And who validates the validation?

          • exe34 a minute ago

            the compiler/interpreter are assumed to work in this scenario.

        • tjoff 2 hours ago

          If that is the way to get quality, sure.

          Otherwise I feel that power consumption is the bigger issue than speed, though in this case they are interlinked.

          • threatripper 2 hours ago

            Humans consume a lot of power and resources.

            • croes 24 minutes ago

              The basic efficiency is pretty high.

        • yunohn 2 hours ago

          How does the next machine/LLM know what’s valid or not? I don’t really understand the idea behind layers of hallucinating LLMs.

          • ben_w 2 hours ago

            By comparison with reality. The initial LLMs had "reality" be "a training set of text", when ChatGPT came out everyone rapidly expanded into RLFH (reinforcement learning from human feedback), and now there's vision and text models the training and feedback is grounded on a much broader aspect of reality than just text.

            • croes 22 minutes ago

              Given that there are more and more AI generated texts and pictures that ground will be pretty unreliable.

            • yunohn 2 hours ago

              Could you link to a paper or working POC that shows how this “turtles all the way down“ solution works?

              • ben_w 34 minutes ago

                I don't understand your question.

                This isn't turtles all the way down, it's grounded in real world data, and increasingly large varieties of it.

                • croes 21 minutes ago

                  How does the AI know it’s reality and not a fake image or text fed to the system?

      • croes 26 minutes ago

        Exactly, validating and rewriting the prompt are the real time consuming tasks.

  • neals 8 minutes ago

    So what is inference?

  • majke 26 minutes ago

    I wonder if there is a token/watt metric. Afaiu cerebras uses plenty of power/cooling.

    • accrual a minute ago

      I found this on their product page, though just for peak power:

      > At 16 RU, and peak sustained system power of 23kW, the CS-3 packs the performance of a room full of servers into a single unit the size of a dorm room mini-fridge.

      It's pretty impressive looking hardware.

      https://cerebras.ai/product-system/

  • odo1242 2 hours ago

    What made it so much faster based on just a software update?

    • anon291 2 hours ago

      Ex-cereberas engineer here. The chip is very powerful and there is no 'one way' to do things. Rearchitecting data flow, changing up data layout, etc can lead to significant performance improvements. That's just my informed speculation. There's likely more perf somewhere

    • campers 2 hours ago

        The first implementation of inference on the Wafer Scale Engine and utilized only a fraction of its peak bandwidth, compute, and IO capacity. Today’s release is the culmination of numerous software, hardware, and ML improvements we made to our stack to greatly improve the utilization and real-world performance of Cerebras Inference.
       
        We’ve re-written or optimized the most critical kernels such as MatMul, reduce/broadcast, element wise ops, and activations. Wafer IO has been streamlined to run asynchronously from compute. This release also implements speculative decoding, a widely used technique that uses a small model and large model in tandem to generate answers faster.
    • germanjoey 2 hours ago

      They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.

      A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.

      It seems they also support only a very short sequence length. (1k tokens)

      • bubblethink 2 hours ago

        Speculative decoding does not trade off accuracy. You reject the speculated tokens if the original model does not accept them, kind of like branch prediction. All these providers and third parties benchmark each other's solutions, so if there is a drop in accuracy, someone will report it. Their sequence length is 8k.

  • andrewstuart 3 hours ago

    Could someone please bring Microsoft's Bitnet into the discussion and explain how its performance relates to this announcement, if at all?

    https://github.com/microsoft/BitNet

    "bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. "

    • eptcyka 3 hours ago

      It is an inference engine for 1bit LLMs, not really comparable.

    • BoorishBears 3 hours ago

      The novelty of the inexplicable bitnet obsession has worn off I think.

      • qwertox 2 hours ago

        IDK, they remind me of Sigma-Delta ADCs [0], which are single bit ADCs but used in high resolution scenarios.

        I believe we'll get to hear more interesting things about Bitnet in the future.

        [0] https://en.wikipedia.org/wiki/Delta-sigma_modulation

      • Tepix 2 hours ago

        We have yet to see a large model trained using it, haven't we?

  • anonzzzies 3 hours ago

    Demo, API?

    • selcuka 3 hours ago
      • aliljet 3 hours ago

        That's odd, attempting a prompt fails because auth isn't working.

      • bestest 2 hours ago

        I filled out a lengthy prompt in the demo. submitted it. an auth window pops up. I don't want to login. I want the demo. such a repulsive approach.

        • swyx 25 minutes ago

          chill with the emotionally charged words. their hardware, their rules. if this upsets you you will not have a good time on the modern internet.