Blackwell: Nvidia's GPU

(chipsandcheese.com)

103 points | by pella 18 hours ago ago

28 comments

ggreg84 7 hours ago

Chips and Cheese GPU analysis are pretty detailed, but they need to be taken with a huge grain of salt because the results only really apply to OpenCL and nobody buying NVIDIA or AMD GPUs for Compute runs OpenCL on them; its either CUDA or HIP, which differ widely in parts of their compilation stack.

After reading the entire analysis, I'm left wondering, what observations in this analysis - if any - actually apply to CUDA?

[-]

nromiun 5 hours ago

For benchmarking code like this CUDA, HIP and OpenCL are almost the same. You will only see the difference in big codebases, where you launch multiple kernels and move data between them.

Otherwise OpenCL is very good as well, with the added benefit of running on all GPUs.

almostgotcaught 6 hours ago

> its either CUDA or HIP, which differ widely in parts of their compilation stack.

This is an ironic comment - OpenCL uses the same compiler as CUDA on NVIDIA and HIP on AMD.

[-]

JonChesterfield 5 hours ago

Sort of. Same compiler backend, mostly, but the set of intrinsics and semantic rules are different.

[-]

almostgotcaught 3 hours ago

i have no idea what your point is - same compiler, different frontend, yes that's literally what i said.

Aissen 6 hours ago

Does the comparison even makes sense, considering there's (more than) an order of magnitude difference in price between the AMD's Desktop GPU and NVIDIA's Workstation accelerator?

CalChris 12 hours ago

The Nvidia technical brief says 208 billion transistors.

https://resources.nvidia.com/en-us-blackwell-architecture

Blackwell uses the TSMC 4NP process. It has two layers. A very back of the envelope estimate:

  750mm^2 / (208/2) * 10^9 = 7211 nm^2
  85 nm x 85 nm

NB: process feature size does not equal transistor size. Process feature size doesn't even equal process feature size.

[-]

gchadwick 9 hours ago

> It has two layers

Where did you get that from? Pretty sure it's a single planar set of transistors. Those transistors are manufactured using multiple layers of mask.

FinFET transistors are described as 3D or non-planar but crucially this isn't allowing transistor on transistor stacking you've just got the gate structure of the FinFET poking out above the plane of the rest of the transistors.

Silicon on silicon die stacking is a possibility but limits your power and GPUs run very hot so it's not an option for them.

[-]

murderfs 7 hours ago

GPUs are not particularly hot for compute silicon, they just have ridiculously huge dies. Comparing the 5090 to a Core Ultra 285K, the GPU has a 750 mm^2 die compared to the CPU's 243 mm^2, but has a peak power of 575W compared to 250W. The CPU uses 25% more power per area, and that's before considering the fact that consumer CPUs are packaged for user installation, so there's an extra heatspreader on top of the die, whereas GPUs are sold as integrated units, so the heatsink sits directly on top of the die.

[-]

kvemkon 6 hours ago

> consumer CPUs are packaged for user installation

I'd say advanced users or skilled staff.

20+ years ago e.g. Athlon XP had a small CPU die in the middle and 4 round spacers in the corners for a proper heatspreader installation. Despite the CPU die wouldn't clock down and go in flames in case of cooler removal during operation.

Nowadays with a safer CPU monitoring its temperature, one has to risk to remove the heatspreader and replace it with "special" direct die cooling resulting in either a bit more performance or 15-20 grad lower temperatures or a smaller or a silent cooler. One is free to choose.

Sure, even advanced user must take more care working around the naked die. But the technology to make this safer than before could have also matured.

bgnn an hour ago

This assumes 100% utilization. Ralistically the utilization (active device area wrt total die area) 70-75% at best.

dist-epoch 12 hours ago

You also need space for wires, ..., etc, right? It's not just transistors.

[-]

gchadwick 10 hours ago

The wires sit on top of the transistors. Many layers of them in a modern process.

However you can't always pack the transistors as dense as you would like because you can't fit the wiring for them in above at the same density.

Plus there are various 'design rules' that constrain how things get placed. These are needed to ensure manufacturing is successful and achieved good yield. An important set of rules are the 'antenna rules' that requires the insertion of antenna diodes (using silicon reducing transistor density) to prevent circuitry being destroyed during manufacturing: https://www.zerotoasiccourse.com/terminology/antenna-report/

CalChris 11 hours ago

The wires didn't fit on the back of the envelope.

[-]

a_wild_dandan 11 hours ago

I love this retort and I'm stealing it.

amelius 11 hours ago

The wires run over the transistors.

ksec 13 hours ago

I heard there is still trouble to buy consumer grade Nvidia GPU. At this point I am wondering if it is Gaming market demand, AI, or simply a supply issue.

On another note I am waiting for Nvidia's entry to CPU. At some point down the line I expect the CPU will be less important, ( relatively speaking ) and Nvidia could afford to throw a CPU in the system as bonus. Especially when we are expecting ARM X930 to rival Apple's M4 in terms of IPC. CPU design has become somewhat of a commodity.

[-]

Incipient 13 hours ago

My understanding is it's the AI demand and willingness to pay crazy money for wafer that makes consumer GPUs a significantly less attractive product to produce.

I don't have really solid evidence, just semi-anecdotal/semi-reliable internet posts:

Eg. https://www.tomshardware.com/tech-industry/more-than-251-mil...

Nvidia as a whole has been fairly anti-consumer recently with pricing, so I wouldn't be banking on them for a great cpu option. Weirdly Intel is in the position where they have to prove themselves, so hopefully they'll give us some great products in the next 2-5 years - if they survive (think the old lead-up-to-ryzen era for amd)

[-]

p_l 7 hours ago

The same chip as the "proper"[1] 5090 is also used for workstation and some server cards, which go for easy higher price. So it's just an allocation of child to different products, taking into account that with the power demands and design issues in 5090s power supply there isn't all that much demand for 5090 either.

[1] there are now 5090 branded cards that use same chip as 5080

KronisLV 10 hours ago

> Nvidia as a whole has been fairly anti-consumer recently with pricing, so I wouldn't be banking on them for a great cpu option.

If they’re swimming in the AI cash and the consumer GPU segment isn’t that important (https://www.visualcapitalist.com/nvidia-revenue-by-product-l...) then why on earth couldn’t they do less price gouging?

It feels a bit like the Intel Core Ultra desktop CPU launch where the prices were the critical factor that doomed an otherwise pretty okay product. At least Intel's excuse is that they’re closer to going under than before, even if their GPUs were pretty fairly priced anyways.

It’s almost like everyone complains about their prices and the fact that they’re releasing 8 GB cards… and then still go and give them money anyways.

jonas21 13 hours ago

> I am waiting for Nvidia's entry to CPU.

Haven't they already started doing this with Grace and GB10?

- https://www.nvidia.com/en-us/data-center/grace-cpu/

- https://nvidianews.nvidia.com/news/nvidia-puts-grace-blackwe...

[-]

wtallis 12 hours ago

Their Grace datacenter CPU is basically a chip where they put down all the LPDDR5 memory controllers (albeit curiously slow), NVLINK and PCIe IOs they needed around the perimeter, and then filled in the interior with boring off the shelf ARM cores. It's basically an IO and memory expander that happens to run Linux.

GB10 when it ships might be more interesting, since it'll go into systems that need to support use cases other than merely feeding a big GPU ML workloads. But it sounds like the CPU chiplet at least was more or less outsourced to Mediatek.

magicalhippo 7 hours ago

The whole missing ROPs saga[1][2] didn't help. I bought a 5070 Ti and had to return it due to missing ROPs. Had to get another brand as replacement, as they had so little stock.

[1]: https://gamersnexus.net/gpus/investigating-nvidias-defective...

[2]: https://nvidia.custhelp.com/app/answers/detail/a_id/5628/~/h...

xl-brain 12 hours ago

The micro center in my neighborhood has hundreds of 5090s in stock. I'm not sure its as hard as it used to be.

enqk 9 hours ago

I keep wondering if the yields have gone all bad with the newer processes

dist-epoch 12 hours ago

Why doesn't NVIDIA also build something like Google TPU, a systolic array processor? Less programmable, but more throughput/power efficiency?

It seems there is a huge market for inference.

[-]

AlotOfReading 11 hours ago

Nvidia tensor cores are small systolic arrays. They'd have to throw out a lot of their ecosystem investments and backwards compatibility to make effective use of them as the main GPU compute, and there's really no need given how competitive their chips are right now.

aurareturn 11 hours ago

  Less programmable, but more throughput/power efficiency?

I also wonder the same. It'd make sense to sell two categories of chips:

Traditional GPUs like Blackwell that can do anything and have backwards compatibility.

Less programmable and more ASIC-like inference chips like Google's TPUs. Inference market is going to be multiple times bigger than training soon.