This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.
I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?
Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.
I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.
I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).
I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.
Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.
The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B
Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.
You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s. The problem with GPU for inference is memory bandwidth.
Thing is nearly all cooling. And look at the diameter on the water cooling pipes. Airflow guides on the fans are solid steel. Apparently the chip itself measures 21.5cm^2. Insane.
You still have to pay for the memory. The Cerebras chip is fast because they use 700x more SRAM than, say, A100 GPUs. Loading the whole model in SRAM every time you compute one token is the expensive bit.
They are doing it with custom silicon with several times more area than 8x H100s. I’m sure they are doing some sort of optimization at execution/runtime, but the primary difference is the sheer transistor count.
It is worth splitting out the stacked memory silicon layers on both too (if Cerebras is set up with external DRAM memory). HBM is over 10 layers now so the die area is a good bit more than the chip area, but different process nodes are involved.
Imagine if you could take Llama 3.1 405B and break it down to a tree of logical gates, optimizing out all the things like multiplies by 0 in one of the bits, etc... then load it into a massive FPGA like chip that had no von Neumann bottleneck, was just pure compute without memory access latency.
Such a system would be limited by the latency across the reported 126 layers worth of math involved, before it could generate the next token, which might be as much as 100 uSec. So it would be 10x faster, but you could have thousands of other independent streams pipelined through in parallel because you'd get a token per clock cycle out the end.
Cerebras is a chip company. They are not using GPUs. Their chip uses wafer scale integration which means it's the physical size of a whole wafer, dozens of GPUs in one.
They have limited memory on chip (all SRAM) and it's not clear how much HBM bandwidth they have per wafer. It's a completely different optimization problem than running on GPU clusters.
There are two big tricks: their chips are enormous, and they use sram as their memory, which is vastly faster than hbm ram being used by GPUs. In fact this is main reason it’s so fast. Groq has the speed because of the same reason.
how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?
In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?
It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.
So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.
I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).
From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.
The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?
I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.
Cerebras' benchmark is most likely under ideal conditions, but I'm not sure it's possible to test public cloud APIs under ideal conditions as it's shared infrastructure so you just don't know if a request is "ideal". I think you can only test these things across significant numbers of requests, and that still assumes that shared resource usage doesn't change much.
I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.
In that case I'm misunderstanding you. Are you saying that it's "BS" that they are reaching ~1k tokens/s? If so, you may be misunderstanding what a Cerebras machine is. Also 8xH100 is still ~half the price of a single Cerebras machine, and that's even accounting for H100s being massively over priced. You've got easily twice the value in a Cerebras machine, they have nearly 1m cores on a single die.
What you can do with current-gen models, along with RAG, multi-agent & code interpreters, the wall is very much model latency, and not accuracy any more.
There are so many interactive experiences that could be made possible at this level of token throughput from 405B class models.
You can create massive variants of OpenAI's 01 model. The "Chain of Thought" tools become way more useful when you can get when you can iterate 100x faster. Right now, flagship LLMs stream responses back, and barely beat the speed a human can read, so adding CoT makes it really slow for human-in-the-loop experiences. You can really get a lot more interesting "thoughts" (or workflow steps, or whatever) when it can do more, without slowing down the human experience of using the tool.
You can also get a lot fancier with tool-usage when you can start getting an LLM to use and reply to tools at a speed closer to the speed of a normal network service.
I've never timed it, but I'm guessing current LLMs don't handle "live video" type applications well. Imagine an LLM you could actually video chat with - it'd be useful for walking someone through a procedure, or advanced automation of GUI applications, etc.
AND the holy-grail of AI applications that would combine all of this - Robotics. Today, Cerebras chips are probably too power hungry for battery powered robotic assistants, but one could imagine a Star-Wars style robot assistant many years from now. You can have a robot that can navigate some space (home setting, or work setting) and it can see its environment and behavior, processing the video in real-time. Then, can reason about the world and its given task, by explicitly thinking through steps, and critically self-challenging the steps.
I am wondering how much cost is needed for serving at such a latency. Of course for customers, static cost depends on the pricing strategy. But still, the cost really determines how widely this can be adopted. Is it only for those business that really need the latency, or this can be generally deployed.
SRAM is usually produced on the same wafer as the rest of the logic. SRAM on an external chip would lose many of the advantages without being significantly cheaper.
I'm a happily-paying customer of Groq but they aren't competitive against Cerebras in the 405b space (literally at all).
Groq has paying customers below the enterprise-level and actually serves all their models to everyone in a wide berth, unlike Cerebras who is very selective, so they have that going for them. But in terms of sheer speed and in the largest models, Groq doesn't really compare.
Normally, I don't think 1000 tokens/s is that much more useful than 50 tokens/s.
However, given that CoT makes models a lot smarter, I think Cerebras chips will be in huge demand from now on. You can have a lot more CoT runs when the inference is 20x faster.
Also, I assume financial applications such as hedge funds would be buying these things in bulk now.
They expect that some of the cores on the wafer will fail, so they have redundant links all throughout the chip, so they can seal off/turn off any cores that fail and still have enough cores to do useful work.
Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!
Impressive! Llama 3.1 405B hitting 969 tokens/s on Cerebras Inference shows just how far AI hardware and models have come. The combination of powerful models and cutting-edge infrastructure is really pushing the boundaries of real-time performance!
This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.
I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?
Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.
I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.
> TechTechPotato YouTube videos on Cerebras
https://www.youtube.com/@TechTechPotato/search?query=cerebra... for anyone also looking. there are quite a lot of them.
I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).
I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.
Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.
The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B
Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.
You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s. The problem with GPU for inference is memory bandwidth.
That means it'll be close to affordable in 3 to 5 years if we follow the curve we've been on for the past decades.
Yeah you can see the cooling requirements by looking at their product images. https://cerebras.ai/wp-content/uploads/2021/04/Cerebras_Prod...
Thing is nearly all cooling. And look at the diameter on the water cooling pipes. Airflow guides on the fans are solid steel. Apparently the chip itself measures 21.5cm^2. Insane.
You still have to pay for the memory. The Cerebras chip is fast because they use 700x more SRAM than, say, A100 GPUs. Loading the whole model in SRAM every time you compute one token is the expensive bit.
One day is doing some heavy heavy lifting here, we’re currently off by ~3-4 orders of magnitude…
Thank you, for the reality check! :)
We have moved 2 orders of magnitude in the last year. Not that unreasonable
So 1000-10000 days? ;)
Maybe not $500, but $500,000
Ah, makes a lot more sense now.
also the WSE3 pulls 15kw. https://www.eetimes.com/cerebras-third-gen-wafer-scale-chip-...
but 8x h100 are ~2.6-5.2kw (I get conflicting info, I think based on pice vs smx) so anywhere between roughly even and up to 2x efficient.
They have a chip the size of a dinner plate. Take a look at the pictures: https://cerebras.ai/product-chip/
21 petabytes per second. Can push the whole internet over that chip xD
what kind of yield do they get on that size?
It's near 100%. Discussed here:
https://youtu.be/f4Dly8I8lMY?t=95
Part of their technology is managing/bypassing defects.
I'd love to see the heatsink for this lol
They call it the “engine block”!
https://www.servethehome.com/a-cerebras-cs-2-engine-block-ba...
They are doing it with custom silicon with several times more area than 8x H100s. I’m sure they are doing some sort of optimization at execution/runtime, but the primary difference is the sheer transistor count.
https://cerebras.ai/product-chip/
To be specific, a single WSE-3 has the same die area as about 57 H100s. It's a big chip.
It is worth splitting out the stacked memory silicon layers on both too (if Cerebras is set up with external DRAM memory). HBM is over 10 layers now so the die area is a good bit more than the chip area, but different process nodes are involved.
Amazing!
Imagine if you could take Llama 3.1 405B and break it down to a tree of logical gates, optimizing out all the things like multiplies by 0 in one of the bits, etc... then load it into a massive FPGA like chip that had no von Neumann bottleneck, was just pure compute without memory access latency.
Such a system would be limited by the latency across the reported 126 layers worth of math involved, before it could generate the next token, which might be as much as 100 uSec. So it would be 10x faster, but you could have thousands of other independent streams pipelined through in parallel because you'd get a token per clock cycle out the end.
This is the future I want to build.
Cerebras is a chip company. They are not using GPUs. Their chip uses wafer scale integration which means it's the physical size of a whole wafer, dozens of GPUs in one.
They have limited memory on chip (all SRAM) and it's not clear how much HBM bandwidth they have per wafer. It's a completely different optimization problem than running on GPU clusters.
There are two big tricks: their chips are enormous, and they use sram as their memory, which is vastly faster than hbm ram being used by GPUs. In fact this is main reason it’s so fast. Groq has the speed because of the same reason.
how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?
In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?
It should work, I believe. And anything that doesn't fit you can leave on your system RAM.
Looks like an H100 runs about $30K online for one. Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?
It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.
So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.
Nah. Try vLLM and 405B FP8 on that hardware. And make sure you’re benchmarking with some concurrency for max TPS.
Check out BaseTen for performant use of GPUs
I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).
From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.
The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?
I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.
Everyone presumes this is under ideal conditions...and it's incredible.
It's bs=1. At 1,000 t/s. Of a 405B parameter model. Wild.
Cerebras' benchmark is most likely under ideal conditions, but I'm not sure it's possible to test public cloud APIs under ideal conditions as it's shared infrastructure so you just don't know if a request is "ideal". I think you can only test these things across significant numbers of requests, and that still assumes that shared resource usage doesn't change much.
I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.
In that case I'm misunderstanding you. Are you saying that it's "BS" that they are reaching ~1k tokens/s? If so, you may be misunderstanding what a Cerebras machine is. Also 8xH100 is still ~half the price of a single Cerebras machine, and that's even accounting for H100s being massively over priced. You've got easily twice the value in a Cerebras machine, they have nearly 1m cores on a single die.
Ha ha. He probably means ”at a batch size of 1”, i.e. not even using some amortization tricks to get better numbers.
Ah! That does make more sense!
Right, I'd assume most LLM benchmarks are run on dedicated hardware.
The fact that such a boost is possible with new hardware, I wonder what the ceiling is for improving performance for training via hardware as well.
The ultimate solution would be to convert an LLM to a pure ASIC.
My guess is that would 10X the performance. But then it's a very very expensive solution.
What you can do with current-gen models, along with RAG, multi-agent & code interpreters, the wall is very much model latency, and not accuracy any more.
There are so many interactive experiences that could be made possible at this level of token throughput from 405B class models.
Like what..
You can create massive variants of OpenAI's 01 model. The "Chain of Thought" tools become way more useful when you can get when you can iterate 100x faster. Right now, flagship LLMs stream responses back, and barely beat the speed a human can read, so adding CoT makes it really slow for human-in-the-loop experiences. You can really get a lot more interesting "thoughts" (or workflow steps, or whatever) when it can do more, without slowing down the human experience of using the tool.
You can also get a lot fancier with tool-usage when you can start getting an LLM to use and reply to tools at a speed closer to the speed of a normal network service.
I've never timed it, but I'm guessing current LLMs don't handle "live video" type applications well. Imagine an LLM you could actually video chat with - it'd be useful for walking someone through a procedure, or advanced automation of GUI applications, etc.
AND the holy-grail of AI applications that would combine all of this - Robotics. Today, Cerebras chips are probably too power hungry for battery powered robotic assistants, but one could imagine a Star-Wars style robot assistant many years from now. You can have a robot that can navigate some space (home setting, or work setting) and it can see its environment and behavior, processing the video in real-time. Then, can reason about the world and its given task, by explicitly thinking through steps, and critically self-challenging the steps.
Imagine increasing the quality and FPS of those AI-generated minecraft clones and experiencing even more high-quality, realtime AI-generated gameplay
(yeah, I know they are doing textual tokens. but just sayin..)
edit: context is https://oasisaiminecraft.com/
Damn that's a big model and that's really fast inference.
I am wondering how much cost is needed for serving at such a latency. Of course for customers, static cost depends on the pricing strategy. But still, the cost really determines how widely this can be adopted. Is it only for those business that really need the latency, or this can be generally deployed.
Maybe it could become standard for everyone to make giant chips and use SRAM?
How many SRAM manufacturers are there? Or does it somehow need to be fully integrated into the chip?
SRAM is usually produced on the same wafer as the rest of the logic. SRAM on an external chip would lose many of the advantages without being significantly cheaper.
Holy bananas, the title alone is almost its own language.
No mention of their direct competitor Groq?
I'm a happily-paying customer of Groq but they aren't competitive against Cerebras in the 405b space (literally at all).
Groq has paying customers below the enterprise-level and actually serves all their models to everyone in a wide berth, unlike Cerebras who is very selective, so they have that going for them. But in terms of sheer speed and in the largest models, Groq doesn't really compare.
Is this because 405b doesn't fit on Groq? If they perform better, I would also have liked to have seen.
When 405b first launched Groq ran it, it's not currently running due to capacity issues though
Normally, I don't think 1000 tokens/s is that much more useful than 50 tokens/s.
However, given that CoT makes models a lot smarter, I think Cerebras chips will be in huge demand from now on. You can have a lot more CoT runs when the inference is 20x faster.
Also, I assume financial applications such as hedge funds would be buying these things in bulk now.
Their hardware is cool and bizarre. It has to be seen in person to be believed. It reminds me of the old days when supercomputers were weird.
Don't leave us hanging, show us a weird computer!
So out of all AI chip startups, Cerebras is probably the real deal
Groq is legitimate. Cerebras so far doesn't scale (wide) nearly as good as Groq. We'll see how it goes.
Google TPUs, Amazon, a YC funded ASIC/FPGA company, a Chinese Co. all have custom hardware too that might scale well.
just in time for their ipo
It got cancelled/postponed.
How does binning work when your chip is the entire wafer?
They expect that some of the cores on the wafer will fail, so they have redundant links all throughout the chip, so they can seal off/turn off any cores that fail and still have enough cores to do useful work.
I'm so curious to see some multi-agent systems running with inference this fast.
There's no good open source agent models at the moment unfortunately.
Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!
Not open beta until Q1 2025
I wonder if Cerebras could generate video decent quality in real time
Impressive! Llama 3.1 405B hitting 969 tokens/s on Cerebras Inference shows just how far AI hardware and models have come. The combination of powerful models and cutting-edge infrastructure is really pushing the boundaries of real-time performance!