The GPU is not always faster

(cowfreedom.de)

219 points | by CowFreedom a year ago ago

105 comments

ssivark a year ago

A good mental model is to compare the number of floats being processed -vs- the number of primitive computations. Matrix multiplication has n^3 computation with n^2 data. Multiplication of large matrices is therefore special in the potential for "data re-use" (each float is used against all the columns or rows of the other matrix) -- so systems are designed to have a much higher flops throughput than memory bandwidth. A dot product is at the other extreme, where each float is used only once (loosely).

Roofline plots [1] are framework to visualize system design from this perspective.

[1] https://en.wikipedia.org/wiki/Roofline_model

[-]

hansvm a year ago

That property is the same reason you don't incur substantial overhead doing large matrix multiplications sharded over disks or many machines. You apply the same chunking strategies used to optimally use L1/L2/L3 caches, just instead at the level of numa nodes, physical disks, machines, and clusters. So long as each "cache" is big enough for the N^3/N^2 term to dominate communication overhead (especially if that communication can happen concurrently), the networked result is about as fast as the individual machines running at their max FLOPs for some smaller problem.

[-]

hansvm a year ago

It looks like a dozen people found this helpful. As a related idea, it's one of the main reasons batch inference is so much more efficient in ML. You transmute the problem from memory-bound to compute-bound.

fulafel a year ago

This is indeed a good moel. From the article's context, dGPUs have more of both bandwidth and flops, the compute-intensity balance isn't necessarily the deciding factor on whether there's a speedup on GPU.

In this article the deciding factor seems to be the startup cost because the application has placed the data on the CPU memory side, and is considering shipping out to GPU memory just for this computation.

[-]

KeplerBoy a year ago

You can modify the roofline model to include the PCIe bandwidth. It is sometimes called a hierarchical roofline model.

sigmoid10 a year ago

This is amplified even more by the fact that only the trivial implementation of matmul is O(n^3) whereas efficient ones (e.g BLAS) use things like the Strassen algorithm. You can also speed it up significantly by using cache-aware approaches when retrieving rows and columns. In practice there is a huge amount of theory behind this that is far beyond the average person's scope if they are not actual researchers.

[-]

rnrn a year ago

Is there actually a BLAS implementation that uses strassen?

I don’t think it’s accurate that only trivial implementations use the direct o(n^3) algorithm. AFAIK high performance BLAS implementations just use highly optimized versions of it.

[-]

leecarraher a year ago

BLAS is just the library definition and not the implementation, so BLAS implementations could implement GEMM anyway they want. But in practice the triple loop method (n^3) is the most common, despite Strassen's and the more numerically stable, Winograd methods being well known and available for decades. But with most things involving real computing hardware, memory access patterns and locality tend to be more important for performance than operation counts

chessgecko a year ago

I remember reading that it’s too hard to get good memory bandwidth/l2 utilization in the fancy algorithms, you need to read contiguous blocks and be able to use them repeatedly. But I also haven’t looked at the gpu blas implementations directly.

jcranmer a year ago

AIUI, Strassen gets used moderately commonly with non-floating-point datatypes, where numerical stability is less of a concern and multiplications are more useful to minimize than memory traffic. But from what I can tell, every floating-point BLAS library eschews Strassen, despite a steady trickle of papers saying "hey, there might be some small wins if we go to Strassen!"

[-]

chillee a year ago

The big issue with Strassen isn't performance - it's numerical stability.

ryao a year ago

Not that long ago, I tried using the FFT to do matrix multiplication since it was supposed to be asymptomatically faster. It turns out that the constant factor is huge compared to the O(n^3) grade school algorithm that BLAS optimizes via tiling and other tricks. Even if it looks expensive on paper, the cubic algorithm is fast.

I just wish I understood the tricks done to make it so fast so I could implement my own for variations for which there are no pre-existing BLAS implementations. The best BLAS implementations are all closed source sadly.

[-]

david-gpu a year ago

> The best BLAS implementations are all closed source sadly.

NVidia open-sourced CUTLASS [0] some years ago and it achieves pretty competitive performance compared to e.g. the closed-source cuBLAS.

Keen observers will notice that Strassen is not used in CUTLASS.

[0] https://github.com/NVIDIA/cutlass

Ballas a year ago

Using FFT to dot matmul is much more memory intensive, IIRC.

CUDNN supports FFT to do matmul as well as convolution/correlation and can also be configured to automatically use the best algorithm.

In some cases the FFT method has the incidental side-benefit of data reuse, like in the case of FIR filters, where the data allows for partitioned convolution.

sestep a year ago

Really? I thought that no practical linear algebra library used the Strassen algorithm. Can you provide a source?

[-]

CowFreedom a year ago

The BLAS GEMM routines I have seen use normal blocked algorithms.

[-]

bee_rider a year ago

I don’t know what the real convention is, but IMO, BLAS GEMM “is” the O(n^3) algorithm (blocked is fine of course$ in the sense that something like Strassen has stability implications and isn’t appropriate for lots of sizes. Just swapping it in would be nuts, haha.

phkahler a year ago

Strassen works best on power of 2 sized matrices. That's not a restriction you'd usually see in a general purpose library.

ithkuil a year ago

Ok strassen and others are better but they are still O(n^w) where 2 < w < 3

alecco a year ago

> Each version is severely memory bandwidth bottlenecked, the CUDA version suffers the most with its practical 11.8 GB/s device-to-host bandwidth due to its PCI-Express 3.0 x16 interface.

PCIe 3.0? What?

https://cowfreedom.de/#appendix/computer_specs/

> GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)

> Intel Core i5-8300H (2020)

This is a low-price 8 year old GPU and a 4 year old CPU. And he seems to be including loading the data to GPU. Newer cards have wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-Hopper.

Also he is comparing his own CUDA implementation. He should use one of the many available in CUBLAS/CUTLASS. Making a good CUDA GEMM is a very difficult art and very hardware specific

[-]

jsheard a year ago

> Newer cards have wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-Hopper.

There aren't any (edit: consumer) GPUs with PCIe5 yet, though they probably aren't far off. Plenty already have PCIe4 though.

[-]

alecco a year ago

Consumer cards are PCIe 4.0 x16. H100 PCIe version is PCIe 5.0 https://en.wikipedia.org/wiki/Hopper_(microarchitecture). And it's been out 2 years already.

[-]

yvdriess a year ago

For the consumer GPUs, PCIe 4.0 x16 has plenty of BW headroom. The full sized x16 is more for stability reasons. Some vendors even put a couple of M.2 slots on PCI4/5 GPU board to recuperate the unused PCIe lanes.

[-]

KeplerBoy a year ago

That's mixing apples with oranges.

Some low-end or middle range GPUs really only use 8 lanes, because that's how the chip is designed. Fewer lanes -> less silicon area for the pcie logic needed -> cheaper.

The chips which use 16 lanes take advantage of it and can saturate the link.

[-]

wtallis a year ago

Smaller GPU silicon is also often designed more around the laptop market, where it's common for the CPU to not have more than 8 lanes available for the GPU.

[-]

shaklee3 a year ago

what does a laptop GPU have to do with a server grade xeon that the article is comparing against?

a year ago

[deleted]

throwway120385 a year ago

So everyone is supposed to do all of their testing on H100's?

[-]

alecco a year ago

4090 (2022) PCIe 4.0 x16 is quite decent. The major limit is memory, not bandwidth. And 3090 (2020) is also PCIe 4.0 x16, and used cards are a bargain. You can hook them up with Nvlink.

Nvidia is withholding new releases but the current hardware has more legs with new matrix implementations. Like FlashAttention doing some significant improvement every 6 months.

Nvidia could make consumer chips with combined CPU-GPU. I guess they are too busy making money with the big cloud. Maybe somebody will pick up. Apple is already doing something like it even on laptops.

_zoltan_ a year ago

get a GH100 on lambda and behold you have 900GB/s between CPU memory and GPU, and forget PCIe.

[-]

imtringued a year ago

That doesn't add up, because you're now exceeding the memory bandwidth of the memory controller. I.e. it would be faster to do everything, including the CPU only algorithms, in far away VRAM.

[-]

gpderetta a year ago

latency might be higher and latency usually dominates CPU side algorithms.

saagarjha a year ago

Where are you seeing 900 GB/s?

[-]

KeplerBoy a year ago

The 900 GB/s are a figure often cited for Hopper based SXM boards, it's the aggregate of 18 NVLink connections. So it's more of a many-to-many GPU-to-GPU bandwidth figure.

https://www.datacenterknowledge.com/data-center-hardware/nvi...

[-]

shaklee3 a year ago

this is bidirectional bw and it's gh200, not gh100

jsheard a year ago

TIL, I missed Hopper already having it. I assume the RTX 5000 series will bring it to consumers.

touisteur a year ago

Well let me tell you about the pretty high-end and expensive L40 that shipped with PCIe-4.0 to my utter dismay and disgust. Only the H100 had 5.0 although I could already saturate 4.0 (and 5.0) with Mellanox NICs and GPU/StorageDirect. Waiting for the next one to maybe get 5.0.

goosedragons a year ago

Supposedly the Intel B580 releasing friday will use PCIe 5.0 8x.

[-]

yvdriess a year ago

PCIe 4.0 16x

https://www.intel.com/content/www/us/en/products/sku/227961/...

[-]

phonon a year ago

That's A580. Correct link https://www.intel.com/content/www/us/en/products/sku/241598/...

"PCI Express 4.0 x8"

[-]

alecco a year ago

So Intel is doing the same artificial performance limitation. Nvidia 4060s are also PCIe 4.0 with half lanes (x8). Argh.

[-]

phonon a year ago

Alternatively, at that performance level it doesn't make a difference, so better to save the $5....

_zoltan_ a year ago

that's not true, H100 NVL is PCIe gen5 x16.

Ballas a year ago

The CPU was launched in Q2 2018, so that is also 6 years. I wonder what the outcome will be with a CPU that supports AVX-512 and a more recent GPU.

_zoltan_ a year ago

GH100 can do 900GB/s HtoD.

[-]

alecco a year ago

And both 3090 and 4090 can do 32 GB/s host-device. Not far from CPU-RAM. You only load the matrix once. The bandwidth for the matmul is orders of magnitude larger and happens all in device and mostly in cache.

jing a year ago

No it can’t. That’s d to d

[-]

_zoltan_ a year ago

no.

[root@gh200 nvbandwidth]# nvidia-smi | grep GH200

| 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | 0 |

[root@gh200 nvbandwidth]# ./nvbandwidth | grep -E 'host_to_device_memcpy_sm|device_to_host_memcpy_sm' | grep ^SUM

SUM host_to_device_memcpy_sm 357.45

SUM device_to_host_memcpy_sm 352.05

[root@gh200 nvbandwidth]#

d2d is much higher:

[root@gh200 cuda-samples]# ./bin/sbsa/linux/release/bandwidthTest --dtod | grep -B1 32000

   Transfer Size (Bytes) Bandwidth(GB/s)
   32000000   2294.8

[root@gh200 cuda-samples]#

saagarjha a year ago

Well, device to device is technically doubled because you have a read and a write. But yes

shaklee3 a year ago

No. D2D is arroubd 3TB/s

dragontamer a year ago

There is the compute vs communicate ratio.

For problems like Matrix Multiplication, it costs N to communicate the problem but N^2 operations to calculate.

For problems like dot product, it costs N to communicate but only N operations to calculate.

Compute must be substantially larger than communication costs if you hope to see any benefits. Asymptotic differences obviously help, but linear too might help.

You'd never transfer N data to perform a log(n) binary search for example. At that point communication dominates.

[-]

AnotherGoodName a year ago

For those skimming and to add to the above the article is using the gpu to work with system memory since that’s where they have the initial data and where they want the result in this case and comparing it to a cpu doing the same. The entire bottleneck is GPU to system memory.

If you’re willing to work entirely with the gpu memory the gpu will of course be faster even in this scenario.

[-]

saagarjha a year ago

Assuming your task is larger than kernel launch overhead, of course.

HarHarVeryFunny a year ago

Sure, but once you've justified moving data onto the GPU you don't want to incur the cost of moving the operation output back to the CPU unless you have to. So, for example, you might justify moving data to the GPU for a neural net convolution, but then also execute the following activation function (& subsequent operators) there because that's now where the data is.

[-]

leeter a year ago

So awhile back I was working on a chaotic renderer. This let me to a really weird set of situations:

* If the GPU is a non-dedicated older style intel GPU, use CPU

* If the GPU is a non-dedicated anything else do anything super parallel on the GPU but anything that can be BRRRRRT via CPU on the CPU because the memory is shared.

* If the GPU is dedicated move everything to GPU memory and keep it there and only pull back small statistics if at all plausible.

lmeyerov a year ago

Comparing multicore wide AVX to CUDA is a bit of an unnecessary nuance for most folks. These make sense, but miss the forest from the trees:

- Either way, you're writing 'cuda style' fine-grained data parallel code that looks and works very different from regular multithreaded code. You are now in a different software universe.

- You now also have to think about throughput, latency hiding, etc. Nvidia has been commoditizing throughput-oriented hardware a lot better than others, and while AMD is catching up on some workloads, Nvidia is already advancing. This is where we think about bandwidth between network/disk=>compute unit. My best analogy here, when looking at things like GPU Direct Storage/Network, is CPU systems feel like a long twisty straw, while GPU paths are fat pipes. Big compute typically needs both compute + IO, and hardware specs tell you the bandwidth ceiling.

To a large extent, ideas are cross-polinating -- CPUs looking more like GPUs, and GPUs getting the flexibility of CPUs -- but either way, you're in a different universe of how code & hardware works than 1990s & early 2000s intel.

[-]

bee_rider a year ago

Realistically you should use Numpy or Cupy (or whatever the appropriate/fashionable library is) anyway, because tuning this stuff is a big pain.

So, GPUs have the slight disadvantages that you have to think an about data movement and the drivers are a little less convenient to install, but it isn’t really a big deal.

[-]

PartiallyTyped a year ago

I am a big fan of jax for numerical computations these days.

[-]

bee_rider a year ago

I’ve been seeing lots of posts about it lately. Haven’t had a chance to try it out, though.

lmeyerov a year ago

Agreed! The bigger shift is switching to data parallel coding styles.

jsight a year ago

TBH, I'm finding that people underestimate the usefulness of CPU in both inference and fine tuning. PEFT with access to 64GB+ RAM and lots of cores can sometimes be cost effective.

[-]

ramoz a year ago

I think engineers learn this quickly in high-scale/performance production environments. Even without hardware backgrounds. SLAs/costs create constraints you need to optimize against after promising the business line these magical models can enable that cool new feature for a million users.

Traditional AI/ML models (including smaller transformers) can definitely be optimized for mass scale/performance on cpu-optimized infrastructure.

tylermw a year ago

For simple operations like the dot product (that also map extremely well to SIMD operations), yes, the CPU is often better, as there not much actual "computation" being done. More complex computations where the data does not need to transfer between the host and device amortize that transfer cost across multiple operations, and the balance can quickly weigh in favor of the GPU.

[-]

a year ago

[deleted]

glitchc a year ago

It's the relative difference of transfer overhead vs degree of compute. For one single operation, sure, the transfer overhead dominates. Add multiple compute steps (operations) however, and experiments will show that the GPU is faster as the transfer cost is fixed.

SigmundA a year ago

Would be interesting to see what a unified memory setup can do like say an Apple M-series since this is the argument for unified memory, zero copy memory access between CPU and GPU.

[-]

mirsadm a year ago

Unified memory is what makes using the GPU viable in my use case (mobile). The copy operation is almost always the slowest part. This is especially true for real time work.

CowFreedom a year ago

Even the integrated Intel HD Graphics would be an interesting comparison.

ramoz a year ago

Simpler research could've shown that there is a physical data transfer cost.

[-]

hershey890 a year ago

Yeah classic use cases of GPUs like deep learning have you transfer the weights for the entire model to your GPU(s) at the of inference and after you that you only transfer your input over.

The use case of transferring ALL data over every time is obviously misusing the GPU.

If anyone’s ever tried running a model that’s too large for your GPU you will have experienced how slow this is when you have to pull in the model in parts for a single inference run.

ltbarcly3 a year ago

Taking a helicopter is not always faster than walking.

Is this surprising or obvious?

shihab a year ago

An otherwise valid point made using a terrible example.

[-]

juunpp a year ago

Terrible post, really. Did they need 5 pages to say that a silly dot product micro-bench that is PCIE-bound loses to a CPU implementation? Why are they even comparing computation vs computation + memory transfer?

[-]

refulgentis a year ago

Because going to the GPU adds the overhead of memory and this is a simple way to demonstrate that. Did you know that already? Congrats, you're in the top 10% of software engineers

[-]

juunpp a year ago

You still don't need the unwieldy blog post to explain that. The SIMD section is unnecessary. Comparing the naive dot product on CPU vs GPU (+round-trip transfer) would have sufficed.

kittikitti a year ago

But how will people be impressed if you don't exhaust them with jargon?

hangonhn a year ago

Question from someone who doesn't know enough about GPUs: Recently a friend mentioned his workstation has 384 cores using 4 processors. This is starting to approach some of the core numbers of earlier GPUs.

Is there a possibility that in the not too distant future that GPUs and CPUs will just converge? Or are the tasks done by GPUs too specialized?

[-]

owlbite a year ago

CPUs and GPUs are fundamentally aiming at different vertices of the performance polygon.

CPUs aim to minimize latency (how many cycles have to pass before you can use a result), and do so by way of high clock frequencies, caches and fancy micro architectural tricks. This is what you want in most general computation cases where you don't have other work to do whilst you wait.

GPUs instead just context switch to a different thread whilst waiting on a result. They hide their latency by making parallelism as cheap as possible. You can have many more cores running at a lower clock frequency and be more efficient as a result. But this only works if you have enough parallelism to keep everything busy whilst waiting for things to finish on other threads. As it happens that's pretty common in large matrix computations done in machine learning, so they're pretty popular there.

Will they converge? I don't think so - they're fundamentally different design points. But it may well be that they get integrated at a much closer level than current designs, pushing the heterogeneous/dark silicon/accelerator direction to an extreme.

nox101 a year ago

GPU single threads are up to 20x slower the CPU threads. GPUs get their speed from massive parallelization, SIMD style "do N things 1 one instruction", and some specialized hardware (like texture samplers)

If you take a serial algorithm and put it on the GPU, it's easy to verify that a single GPU thread is much slower than a single thread on the CPU. For example, just do a bubble sort on the GPU with a single thread. I'm not even including the time to transfer data or read the result. You'll easily find the CPU is way faster.

The way you get GPU speed is by finding/designing algorithms that are massively parallel. There are lots of them. There are sorting solutions for example.

As an example, 100 cores * 32 execution units per core = 3200 / 20 = 160x faster than the CPU if you can figure out a parallel solution. But, not every problem can be solved with parallel solutions and if it can't then there's where the CPU wins.

It seems unlikely GPU threads will be as fast as CPU threads. They get their massive parallelism by being simpler.

That said, who knows what the future holds.

[-]

shaklee3 a year ago

This is totally inaccurate. There is nothing 20x slower about a GPU core unless you're nitpicking, in which case you can show those same GPU cores are 20x faster than CPU for other things.

[-]

nox101 a year ago

Do you have examples of a single GPU thread outperforming a single CPU thread?

Here's mine

https://jsfiddle.net/jw7a6to9/ bubblesort

https://jsfiddle.net/y1w6s9tj/ taylor series

> in which case you can show those same GPU cores are 20x faster than CPU for other things

Which things? Remember, I wrote single thread, no SIMD, no samplers. It's the parallelism that provides the speed.

krapht a year ago

Too specialized. You can't use GPUs as general purpose computers. The basic unit of operation is the warp, which is 32 threads operating in lockstep (simplified). If you're not using all 32 threads, then you may as well not be using a GPU.

JonChesterfield a year ago

They're really similar already. You can program a GPU much like you would a CPU (existence proof at https://news.ycombinator.com/item?id=42387267). There's a lot of obfuscation and hoarding of the ancient knowledge from the GPUs are special enthusiasts but the magic doesn't survive looking at the things. It's a masked vector isa.

My dev GPU is a 6800XT. Cheapish gaming card from a little while ago, 16GB ram on the card. 72 "compute units" which are independent blocks of hardware containing memory ports, floating point unit, register file etc. Roughly "a core" from x64 world. Each of those can have up to 64 tasks ready to go, roughly a "hyperthread". It's 300W or so.

There's some noise in the details, e.g. the size of the register file from the perspective of a hyperthread affects how many can be resident on the compute unit ready to run, the memory hierarchy has extra layers in it. The vector unit is 256byte wide as opposed to 64byte wide on x64.

But if you wanted to run a web browser entirely on the GPU and were sufficiently bloody minded you'd get it done, with the CPU routing keyboard I/O to it and nothing else. If you want a process to sit on the GPU talking to the network and crunching numbers, don't need the x64 or arm host to do anything at all.

immibis a year ago

You'd also need extreme hyperthreading. A GPU can cycle between several warps in the same execution unit (barrel-processor-style), padding out the time per instruction to hide memory latency, while still getting the same throughput. That's counter to the fundamental design of CPUs.

gdiamos a year ago

Memory locality depends on your perspective.

The CPU would always be slower if the data originated in GPU memory.

[-]

KeplerBoy a year ago

Not necessarily, i'm sure one could construct scenarios where offloading to the CPU makes sense. Think of highly branchy double precision stuff.

fancyfredbot a year ago

This article is a CPU benchmark and a PCI express bandwidth benchmark. It's masquerading as something else though.

shaklee3 a year ago

To save some time for people, the title is absolutely incorrect. The GPU is significantly faster for this test, but the author is measuring PCIe bandwidth from a PCIe generation of 10 years ago. If instead they had used pcie5 the bandwidth would be quadrupled.

a year ago

[deleted]

rnrn a year ago

how can the multicore AVX implementation do a dot product (for arrays much larger than cache) at 340 GB/s on a system with RAM bandwidth < 50 GB/s

[-]

rnrn a year ago

Answer: it can’t.

The author has updated the post with corrected AVX measurements, with the original ~340 GB/s revised down to 31.7 GB/s. (Thanks CowFreedom)

alecco a year ago

I think the post is a bit disingenuous.

But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.

I'd exploit the better multi-level cache hierarchy in CPUs and make the code NUMA aware. But still I wouldn't bet against a recent GPU card.

[-]

rnrn a year ago

> But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.

The post is about dot product, not matrix multiply. Dot product has no data reuse

ImHereToVote a year ago

The GPU is never faster. It's parallel.

ryao a year ago

The same thing applies to using a GPU to do inference with your weights in system memory. That is why nobody does that.

moomin a year ago

Even if the GPU took literally no time at all to compute the results there would be workflows where doing it on the CPU was faster.

[-]

Waterluvian a year ago

The GPU is the gas station across town that’s five cents cheaper.

[-]

Legend2440 a year ago

No, it’s the bulk order from China that’s 100x cheaper but has a minimum order quantity of 100000 units and takes 6-8 weeks to get here.

_zoltan_ a year ago

this is simply not true. the post just uses outdated hardware, as pointed out above.

a GH200 will run miles around any CPU.

[-]

CowFreedom a year ago

The gist of the post is that optimizations and interpretations thereof must always be made with respect to the underlying hardware.

Dylan16807 a year ago

In this analogy, using better hardware affects the discount per gallon, but there are still situations where the closer gas station is the better choice.

adamc a year ago

Good analogy.

bob1029 a year ago

L2 cache is 1000x closer than the PCIe bus. Pretty much anything that has to respond in ~realtime to outside events will run better on the CPU. You can use the GPU to visualize the state of the system with some small delay (e.g., video games), but it is not so great at modifying state in an efficient manner - especially when serialization of events & causality are important.

[-]

gpderetta a year ago

> ~realtime to outside events

well, to play the devil advocate, for the outside event to affect the CPU, a signal will have to go through the PCI bus or equivalent.

a year ago

[deleted]