26 comments

  • rbanffy an hour ago

    Very impressive numbers - I'd expect 2K tok/s on Cerebras hardware, not H200's.

  • dust42 2 hours ago

    If I followed the links correctly this benchmark was made on a 16xH200. At current prices I'd assume that is a system price of around $750,000.

    The year has 86400*365 = 31536000 seconds. Thus 63072000000 tokens can be generated. As pricing is usually given per 1M tokens generated, this is 63072 such packages.

    Now lets write off the investment over 3 years, 250,000/63072 = 3.96. So almost $4 per 1M tokens generated with prompt processing included.

    Model was a Deepseek 671B 32B MoE.

    Looks to me that $20 for a month of coding is not very sustainable - let's enjoy the party while VCs are financing it! And keep an eye on your consumption...

    Electricity costs seem negligable with ~$10,000 per year at 10cts per kWh but overall cost would be ~10% higher if electricity is more like 30cts like it is in Europe.

    Edit: like it is pointed out by other commenters it is 2200t/s per single GPU thus the result needs to be divided by 16: $4/16 = $0.25. This actually somewhat matches the deepseek API pricing.

    • yorwba an hour ago

      It's 2.2k tokens per second and GPU, so you have to multiply the token output by 16 and the price per million tokens works out to 22.5 cents.

    • supermatt an hour ago

      > more like 30cts like it is in Europe

      Nope - i live in one of the most expensive areas, and even the residential price has averaged 18c/kWh delivered including taxes. Businesses get a lower basic rate and also don't pay the VAT, so it works out around 13c/kWh for them.

      https://data.nordpoolgroup.com/auction/day-ahead/prices?deli...

      • t0mas88 an hour ago

        That's excluding tax, net prices around 0.20-0.30 EUR / Kwh we common.

        • supermatt an hour ago

          I updated my comment to include my personal delivered rate including VAT - also note that businesses (like a data center) don't pay the VAT and have substantially reduced delivery fees at high voltage

    • kicks66 an hour ago

      I think you missed something here - its 2.2k tokens _per_ GPU

      So if you work that through its $0.225 per 1M output tokens.

    • edf13 an hour ago

      > let's enjoy the party while VCs are financing it!

      The VC money is there until they can solve the optimization problems

  • kingstnap 8 hours ago

    Impressive performance work. It's interesting that you still see these 40+% perf gains like this.

    Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping.

    • davidhyde 2 hours ago

      vLLM needs to perform similar operations to an operating system. If you write an operating system in Python you will have scope for many 40% improvements all over the place and in the end it won’t be Python anymore, at least under the hood it won’t be.

      • menaerus 35 minutes ago

        It's not about the python at all. Optimization techniques are on a completely different level, on the level of the chip and/or hw platform and finding ways to utilize them in a max manner by exploiting the intrinsic details about their limitations.

    • whoevercares 7 hours ago

      Absolutely. LLM inference is still a greenfield — things like overlap scheduling and JIT CUDA kernels are very recent. We’re just getting started optimizing for modern LLM architectures, so cost/perf will keep improving fast.

  • mycelia an hour ago

    Hey HN! I’m Seiji Eicher from Anyscale, one of the authors of this post :) Feel free to ask questions here.

    • menaerus 43 minutes ago

      Do you use agentic AI yet for this type of optimization work or no?

  • snakepit 6 hours ago

    Still have to update it for snakepit 0.11.0, but I did start a vLLM wrapper for Elixir

    https://hex.pm/packages/vllm

  • androiddrew 7 hours ago

    Now all we need is better support for AMD gpus, both CDNA and RDNA types

    • mappu 6 hours ago

      ZLUDA implements CUDA on top of AMD ROCm - they are explicitly targetting vLLM as their PyTorch compatibility test: https://vosen.github.io/ZLUDA/blog/zluda-update-q4-2025/#pyt...

      (PyTorch does also support ROCm generally, it shows up as a CUDA device.)

      • ikari_pl 3 hours ago

        I feel like these technologies are named by the Polish at the companies. "CUDA" means "WONDERS" and "ZŁUDA" would be an "ILLUSION".

    • sofixa an hour ago

      You can run vLLM with AMD GPUs supported by ROCm: https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/infer...

      However from experience with an AMD Strix Halo, a couple of caveats: it's drastically slower than Ollama (tested over a few weeks, always using the official AMD vLLM nightly releases), and not all GPUs were supported for all models (but that has been fixed).

  • danielhanchen 8 hours ago

    Love vLLM!

  • vessenes 7 hours ago

    As a user of a lot of coding tokens I’m most interested in latency - these numbers are presumably for heavily batched workloads. I dearly wish Claude had a cerebras endpoint.

    I’m sure I’d use more tokens because I’d get more revs, but I don’t think token usage would increase linearly with speed: I need time to think about what I want to and what’s happened or is proposed. But I feel like I would be able to stay in flow state if the responses were faster, and that’s super appealing.

  • behnamoh 3 hours ago

    I couldn't care less, tbh. This speed is ridiculously high, to the point where tool calls, not inference, become the bottleneck.

    Also, I'd rather run a large model at slower speeds than a smaller at insanely high speeds.

    • menaerus 2 hours ago

      Well the thing is that the trajectory of people utilizing the models is only increasing so getting the most out of your HW becomes a particularly interesting optimization point for companies doing the inference at massive scale.

    • spiderfarmer 3 hours ago

      You care enough to comment, so you could in fact have cared even less.

      Also, the entire industry profits from the work that’s done at the bleeding edge. That’s the case in every industry.

    • est 2 hours ago

      Have you considered parrallel processing? I always have 2-3 Cursor IDE open because I don't like wait either.

      • bob1029 2 hours ago

        Parallel tool calls do not work for my scenario. I can't ask a copy of my agent a question about something until a dependent call has resolved.

        Tool use that changes the mode of the environment is a good example where you cannot go parallel. I've built a recursive agent that can run a Unity editor and I can't just blindly run whatever it wants in parallel or combos like SwitchScene -> GetSceneOverview won't interleave correctly. You'll wind up with 15 calls that loop over every scene and then you grab the overview from the last scene you switched to 15 times.

        There are ways to hack around it a bit, but at some level the underlying narrative does need to be serialized or you'll be wasting an incredible amount of resources.

        Depth-first search doesn't guarantee the best solution, but on average it's guaranteed to find a solution faster than breadth-first search. It's worth waiting for those dependent calls and going super deep if you want some reasonable answer quickly.