76 comments

  • zackangelo 5 hours ago

    This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

    I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

    • danpalmer 4 hours ago

      Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.

      I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.

      • swyx 4 hours ago

        > TechTechPotato YouTube videos on Cerebras

        https://www.youtube.com/@TechTechPotato/search?query=cerebra... for anyone also looking. there are quite a lot of them.

      • accrual 4 hours ago

        I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).

        • danpalmer 3 hours ago

          I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.

          • trsohmers 22 minutes ago

            Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.

            The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B

            • danpalmer 19 minutes ago

              Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.

              • YetAnotherNick 2 minutes ago

                You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s. The problem with GPU for inference is memory bandwidth.

          • bboygravity 34 minutes ago

            That means it'll be close to affordable in 3 to 5 years if we follow the curve we've been on for the past decades.

          • initplus 43 minutes ago

            Yeah you can see the cooling requirements by looking at their product images. https://cerebras.ai/wp-content/uploads/2021/04/Cerebras_Prod...

            Thing is nearly all cooling. And look at the diameter on the water cooling pipes. Airflow guides on the fans are solid steel. Apparently the chip itself measures 21.5cm^2. Insane.

        • visarga 2 hours ago

          You still have to pay for the memory. The Cerebras chip is fast because they use 700x more SRAM than, say, A100 GPUs. Loading the whole model in SRAM every time you compute one token is the expensive bit.

        • chessgecko 4 hours ago

          One day is doing some heavy heavy lifting here, we’re currently off by ~3-4 orders of magnitude…

          • accrual 3 hours ago

            Thank you, for the reality check! :)

            • thomashop 3 hours ago

              We have moved 2 orders of magnitude in the last year. Not that unreasonable

          • grahamj 3 hours ago

            So 1000-10000 days? ;)

        • killingtime74 3 hours ago

          Maybe not $500, but $500,000

      • zackangelo 4 hours ago

        Ah, makes a lot more sense now.

    • simonw 3 hours ago

      They have a chip the size of a dinner plate. Take a look at the pictures: https://cerebras.ai/product-chip/

    • parsimo2010 4 hours ago

      They are doing it with custom silicon with several times more area than 8x H100s. I’m sure they are doing some sort of optimization at execution/runtime, but the primary difference is the sheer transistor count.

      https://cerebras.ai/product-chip/

      • coder543 4 hours ago

        To be specific, a single WSE-3 has the same die area as about 57 H100s. It's a big chip.

        • cma 3 hours ago

          It is worth splitting out the stacked memory silicon layers on both too (if Cerebras is set up with external DRAM memory). HBM is over 10 layers now so the die area is a good bit more than the chip area, but different process nodes are involved.

        • tomrod an hour ago

          Amazing!

    • mikewarot 12 minutes ago

      Imagine if you could take Llama 3.1 405B and break it down to a tree of logical gates, optimizing out all the things like multiplies by 0 in one of the bits, etc... then load it into a massive FPGA like chip that had no von Neumann bottleneck, was just pure compute without memory access latency.

      Such a system would be limited by the latency across the reported 126 layers worth of math involved, before it could generate the next token, which might be as much as 100 uSec. So it would be 10x faster, but you could have thousands of other independent streams pipelined through in parallel because you'd get a token per clock cycle out the end.

      This is the future I want to build.

    • modeless 4 hours ago

      Cerebras is a chip company. They are not using GPUs. Their chip uses wafer scale integration which means it's the physical size of a whole wafer, dozens of GPUs in one.

      They have limited memory on chip (all SRAM) and it's not clear how much HBM bandwidth they have per wafer. It's a completely different optimization problem than running on GPU clusters.

    • boroboro4 4 hours ago

      There are two big tricks: their chips are enormous, and they use sram as their memory, which is vastly faster than hbm ram being used by GPUs. In fact this is main reason it’s so fast. Groq has the speed because of the same reason.

    • yalok 4 hours ago

      how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?

      In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?

      • qingcharles 2 hours ago

        It should work, I believe. And anything that doesn't fit you can leave on your system RAM.

        Looks like an H100 runs about $30K online for one. Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?

      • zackangelo 4 hours ago

        It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.

        So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.

    • mmaunder 3 hours ago

      Nah. Try vLLM and 405B FP8 on that hardware. And make sure you’re benchmarking with some concurrency for max TPS.

    • hendler 2 hours ago

      Check out BaseTen for performant use of GPUs

  • danpalmer 4 hours ago

    I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).

    From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.

    The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?

    I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.

    • qeternity 3 hours ago

      Everyone presumes this is under ideal conditions...and it's incredible.

      It's bs=1. At 1,000 t/s. Of a 405B parameter model. Wild.

      • danpalmer 2 hours ago

        Cerebras' benchmark is most likely under ideal conditions, but I'm not sure it's possible to test public cloud APIs under ideal conditions as it's shared infrastructure so you just don't know if a request is "ideal". I think you can only test these things across significant numbers of requests, and that still assumes that shared resource usage doesn't change much.

        • qeternity 2 hours ago

          I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.

          • danpalmer an hour ago

            In that case I'm misunderstanding you. Are you saying that it's "BS" that they are reaching ~1k tokens/s? If so, you may be misunderstanding what a Cerebras machine is. Also 8xH100 is still ~half the price of a single Cerebras machine, and that's even accounting for H100s being massively over priced. You've got easily twice the value in a Cerebras machine, they have nearly 1m cores on a single die.

            • sam_dam_gai 34 minutes ago

              Ha ha. He probably means ”at a batch size of 1”, i.e. not even using some amortization tricks to get better numbers.

              • danpalmer 21 minutes ago

                Ah! That does make more sense!

      • colordrops 3 hours ago

        Right, I'd assume most LLM benchmarks are run on dedicated hardware.

  • owenpalmer 11 minutes ago

    The fact that such a boost is possible with new hardware, I wonder what the ceiling is for improving performance for training via hardware as well.

    • bufferoverflow a minute ago

      The ultimate solution would be to convert an LLM to a pure ASIC.

      My guess is that would 10X the performance. But then it's a very very expensive solution.

  • LASR 4 hours ago

    What you can do with current-gen models, along with RAG, multi-agent & code interpreters, the wall is very much model latency, and not accuracy any more.

    There are so many interactive experiences that could be made possible at this level of token throughput from 405B class models.

    • TechDebtDevin 3 hours ago

      Like what..

      • vineyardmike 2 hours ago

        You can create massive variants of OpenAI's 01 model. The "Chain of Thought" tools become way more useful when you can get when you can iterate 100x faster. Right now, flagship LLMs stream responses back, and barely beat the speed a human can read, so adding CoT makes it really slow for human-in-the-loop experiences. You can really get a lot more interesting "thoughts" (or workflow steps, or whatever) when it can do more, without slowing down the human experience of using the tool.

        You can also get a lot fancier with tool-usage when you can start getting an LLM to use and reply to tools at a speed closer to the speed of a normal network service.

        I've never timed it, but I'm guessing current LLMs don't handle "live video" type applications well. Imagine an LLM you could actually video chat with - it'd be useful for walking someone through a procedure, or advanced automation of GUI applications, etc.

        AND the holy-grail of AI applications that would combine all of this - Robotics. Today, Cerebras chips are probably too power hungry for battery powered robotic assistants, but one could imagine a Star-Wars style robot assistant many years from now. You can have a robot that can navigate some space (home setting, or work setting) and it can see its environment and behavior, processing the video in real-time. Then, can reason about the world and its given task, by explicitly thinking through steps, and critically self-challenging the steps.

      • davidfiala 2 hours ago

        Imagine increasing the quality and FPS of those AI-generated minecraft clones and experiencing even more high-quality, realtime AI-generated gameplay

        (yeah, I know they are doing textual tokens. but just sayin..)

        edit: context is https://oasisaiminecraft.com/

  • arthurcolle 15 minutes ago

    Damn that's a big model and that's really fast inference.

  • WiSaGaN 5 hours ago

    I am wondering how much cost is needed for serving at such a latency. Of course for customers, static cost depends on the pricing strategy. But still, the cost really determines how widely this can be adopted. Is it only for those business that really need the latency, or this can be generally deployed.

    • ilaksh 3 hours ago

      Maybe it could become standard for everyone to make giant chips and use SRAM?

      How many SRAM manufacturers are there? Or does it somehow need to be fully integrated into the chip?

      • AlotOfReading 3 hours ago

        SRAM is usually produced on the same wafer as the rest of the logic. SRAM on an external chip would lose many of the advantages without being significantly cheaper.

  • dgfitz 2 hours ago

    Holy bananas, the title alone is almost its own language.

  • fillskills 3 hours ago

    No mention of their direct competitor Groq?

    • icelancer 3 hours ago

      I'm a happily-paying customer of Groq but they aren't competitive against Cerebras in the 405b space (literally at all).

      Groq has paying customers below the enterprise-level and actually serves all their models to everyone in a wide berth, unlike Cerebras who is very selective, so they have that going for them. But in terms of sheer speed and in the largest models, Groq doesn't really compare.

      • hendler 2 hours ago

        Is this because 405b doesn't fit on Groq? If they perform better, I would also have liked to have seen.

        • KTibow an hour ago

          When 405b first launched Groq ran it, it's not currently running due to capacity issues though

  • aurareturn 2 hours ago

    Normally, I don't think 1000 tokens/s is that much more useful than 50 tokens/s.

    However, given that CoT makes models a lot smarter, I think Cerebras chips will be in huge demand from now on. You can have a lot more CoT runs when the inference is 20x faster.

    Also, I assume financial applications such as hedge funds would be buying these things in bulk now.

  • bargle0 3 hours ago

    Their hardware is cool and bizarre. It has to be seen in person to be believed. It reminds me of the old days when supercomputers were weird.

    • IAmNotACellist 3 hours ago

      Don't leave us hanging, show us a weird computer!

  • brcmthrowaway 4 hours ago

    So out of all AI chip startups, Cerebras is probably the real deal

    • icelancer 3 hours ago

      Groq is legitimate. Cerebras so far doesn't scale (wide) nearly as good as Groq. We'll see how it goes.

      • hendler 2 hours ago

        Google TPUs, Amazon, a YC funded ASIC/FPGA company, a Chinese Co. all have custom hardware too that might scale well.

    • gdiamos 3 hours ago

      just in time for their ipo

      • ipsum2 3 hours ago

        It got cancelled/postponed.

  • easeout 2 hours ago

    How does binning work when your chip is the entire wafer?

    • shrubble 2 hours ago

      They expect that some of the cores on the wafer will fail, so they have redundant links all throughout the chip, so they can seal off/turn off any cores that fail and still have enough cores to do useful work.

  • gdiamos 3 hours ago

    I'm so curious to see some multi-agent systems running with inference this fast.

    • ipsum2 3 hours ago

      There's no good open source agent models at the moment unfortunately.

  • germanjoey 4 hours ago

    Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!

  • jadbox 5 hours ago

    Not open beta until Q1 2025

  • kuprel 3 hours ago

    I wonder if Cerebras could generate video decent quality in real time

  • maryndisouza 31 minutes ago

    Impressive! Llama 3.1 405B hitting 969 tokens/s on Cerebras Inference shows just how far AI hardware and models have come. The combination of powerful models and cutting-edge infrastructure is really pushing the boundaries of real-time performance!