A good mental model is to compare the number of floats being processed -vs- the number of primitive computations. Matrix multiplication has n^3 computation with n^2 data. Multiplication of large matrices is therefore special in the potential for "data re-use" (each float is used against all the columns or rows of the other matrix) -- so systems are designed to have a much higher flops throughput than memory bandwidth. A dot product is at the other extreme, where each float is used only once (loosely).
Roofline plots [1] are framework to visualize system design from this perspective.
That property is the same reason you don't incur substantial overhead doing large matrix multiplications sharded over disks or many machines. You apply the same chunking strategies used to optimally use L1/L2/L3 caches, just instead at the level of numa nodes, physical disks, machines, and clusters. So long as each "cache" is big enough for the N^3/N^2 term to dominate communication overhead (especially if that communication can happen concurrently), the networked result is about as fast as the individual machines running at their max FLOPs for some smaller problem.
It looks like a dozen people found this helpful. As a related idea, it's one of the main reasons batch inference is so much more efficient in ML. You transmute the problem from memory-bound to compute-bound.
This is indeed a good moel. From the article's context, dGPUs have more of both bandwidth and flops, the compute-intensity balance isn't necessarily the deciding factor on whether there's a speedup on GPU.
In this article the deciding factor seems to be the startup cost because the application has placed the data on the CPU memory side, and is considering shipping out to GPU memory just for this computation.
This is amplified even more by the fact that only the trivial implementation of matmul is O(n^3) whereas efficient ones (e.g BLAS) use things like the Strassen algorithm. You can also speed it up significantly by using cache-aware approaches when retrieving rows and columns. In practice there is a huge amount of theory behind this that is far beyond the average person's scope if they are not actual researchers.
Is there actually a BLAS implementation that uses strassen?
I don’t think it’s accurate that only trivial implementations use the direct o(n^3) algorithm. AFAIK high performance BLAS implementations just use highly optimized versions of it.
BLAS is just the library definition and not the implementation, so BLAS implementations could implement GEMM anyway they want. But in practice the triple loop method (n^3) is the most common, despite Strassen's and the more numerically stable, Winograd methods being well known and available for decades. But with most things involving real computing hardware, memory access patterns and locality tend to be more important for performance than operation counts
I remember reading that it’s too hard to get good memory bandwidth/l2 utilization in the fancy algorithms, you need to read contiguous blocks and be able to use them repeatedly. But I also haven’t looked at the gpu blas implementations directly.
AIUI, Strassen gets used moderately commonly with non-floating-point datatypes, where numerical stability is less of a concern and multiplications are more useful to minimize than memory traffic. But from what I can tell, every floating-point BLAS library eschews Strassen, despite a steady trickle of papers saying "hey, there might be some small wins if we go to Strassen!"
Not that long ago, I tried using the FFT to do matrix multiplication since it was supposed to be asymptomatically faster. It turns out that the constant factor is huge compared to the O(n^3) grade school algorithm that BLAS optimizes via tiling and other tricks. Even if it looks expensive on paper, the cubic algorithm is fast.
I just wish I understood the tricks done to make it so fast so I could implement my own for variations for which there are no pre-existing BLAS implementations. The best BLAS implementations are all closed source sadly.
Using FFT to dot matmul is much more memory intensive, IIRC.
CUDNN supports FFT to do matmul as well as convolution/correlation and can also be configured to automatically use the best algorithm.
In some cases the FFT method has the incidental side-benefit of data reuse, like in the case of FIR filters, where the data allows for partitioned convolution.
I don’t know what the real convention is, but IMO, BLAS GEMM “is” the O(n^3) algorithm (blocked is fine of course$ in the sense that something like Strassen has stability implications and isn’t appropriate for lots of sizes. Just swapping it in would be nuts, haha.
> Each version is severely memory bandwidth bottlenecked, the CUDA version suffers the most with its practical 11.8 GB/s device-to-host bandwidth due to its PCI-Express 3.0 x16 interface.
> GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)
> Intel Core i5-8300H (2020)
This is a low-price 8 year old GPU and a 4 year old CPU. And he seems to be including loading the data to GPU. Newer cards have wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-Hopper.
Also he is comparing his own CUDA implementation. He should use one of the many available in CUBLAS/CUTLASS. Making a good CUDA GEMM is a very difficult art and very hardware specific
For the consumer GPUs, PCIe 4.0 x16 has plenty of BW headroom. The full sized x16 is more for stability reasons. Some vendors even put a couple of M.2 slots on PCI4/5 GPU board to recuperate the unused PCIe lanes.
Some low-end or middle range GPUs really only use 8 lanes, because that's how the chip is designed. Fewer lanes -> less silicon area for the pcie logic needed -> cheaper.
The chips which use 16 lanes take advantage of it and can saturate the link.
Smaller GPU silicon is also often designed more around the laptop market, where it's common for the CPU to not have more than 8 lanes available for the GPU.
4090 (2022) PCIe 4.0 x16 is quite decent. The major limit is memory, not bandwidth. And 3090 (2020) is also PCIe 4.0 x16, and used cards are a bargain. You can hook them up with Nvlink.
Nvidia is withholding new releases but the current hardware has more legs with new matrix implementations. Like FlashAttention doing some significant improvement every 6 months.
Nvidia could make consumer chips with combined CPU-GPU. I guess they are too busy making money with the big cloud. Maybe somebody will pick up. Apple is already doing something like it even on laptops.
That doesn't add up, because you're now exceeding the memory bandwidth of the memory controller. I.e. it would be faster to do everything, including the CPU only algorithms, in far away VRAM.
The 900 GB/s are a figure often cited for Hopper based SXM boards, it's the aggregate of 18 NVLink connections. So it's more of a many-to-many GPU-to-GPU bandwidth figure.
Well let me tell you about the pretty high-end and expensive L40 that shipped with PCIe-4.0 to my utter dismay and disgust. Only the H100 had 5.0 although I could already saturate 4.0 (and 5.0) with Mellanox NICs and GPU/StorageDirect. Waiting for the next one to maybe get 5.0.
And both 3090 and 4090 can do 32 GB/s host-device. Not far from CPU-RAM. You only load the matrix once. The bandwidth for the matmul is orders of magnitude larger and happens all in device and mostly in cache.
For problems like Matrix Multiplication, it costs N to communicate the problem but N^2 operations to calculate.
For problems like dot product, it costs N to communicate but only N operations to calculate.
Compute must be substantially larger than communication costs if you hope to see any benefits. Asymptotic differences obviously help, but linear too might help.
You'd never transfer N data to perform a log(n) binary search for example. At that point communication dominates.
For those skimming and to add to the above the article is using the gpu to work with system memory since that’s where they have the initial data and where they want the result in this case and comparing it to a cpu doing the same. The entire bottleneck is GPU to system memory.
If you’re willing to work entirely with the gpu memory the gpu will of course be faster even in this scenario.
Sure, but once you've justified moving data onto the GPU you don't want to incur the cost of moving the operation output back to the CPU unless you have to. So, for example, you might justify moving data to the GPU for a neural net convolution, but then also execute the following activation function (& subsequent operators) there because that's now where the data is.
So awhile back I was working on a chaotic renderer. This let me to a really weird set of situations:
* If the GPU is a non-dedicated older style intel GPU, use CPU
* If the GPU is a non-dedicated anything else do anything super parallel on the GPU but anything that can be BRRRRRT via CPU on the CPU because the memory is shared.
* If the GPU is dedicated move everything to GPU memory and keep it there and only pull back small statistics if at all plausible.
Comparing multicore wide AVX to CUDA is a bit of an unnecessary nuance for most folks. These make sense, but miss the forest from the trees:
- Either way, you're writing 'cuda style' fine-grained data parallel code that looks and works very different from regular multithreaded code. You are now in a different software universe.
- You now also have to think about throughput, latency hiding, etc. Nvidia has been commoditizing throughput-oriented hardware a lot better than others, and while AMD is catching up on some workloads, Nvidia is already advancing. This is where we think about bandwidth between network/disk=>compute unit. My best analogy here, when looking at things like GPU Direct Storage/Network, is CPU systems feel like a long twisty straw, while GPU paths are fat pipes. Big compute typically needs both compute + IO, and hardware specs tell you the bandwidth ceiling.
To a large extent, ideas are cross-polinating -- CPUs looking more like GPUs, and GPUs getting the flexibility of CPUs -- but either way, you're in a different universe of how code & hardware works than 1990s & early 2000s intel.
Realistically you should use Numpy or Cupy (or whatever the appropriate/fashionable library is) anyway, because tuning this stuff is a big pain.
So, GPUs have the slight disadvantages that you have to think an about data movement and the drivers are a little less convenient to install, but it isn’t really a big deal.
TBH, I'm finding that people underestimate the usefulness of CPU in both inference and fine tuning. PEFT with access to 64GB+ RAM and lots of cores can sometimes be cost effective.
I think engineers learn this quickly in high-scale/performance production environments. Even without hardware backgrounds. SLAs/costs create constraints you need to optimize against after promising the business line these magical models can enable that cool new feature for a million users.
Traditional AI/ML models (including smaller transformers) can definitely be optimized for mass scale/performance on cpu-optimized infrastructure.
For simple operations like the dot product (that also map extremely well to SIMD operations), yes, the CPU is often better, as there not much actual "computation" being done. More complex computations where the data does not need to transfer between the host and device amortize that transfer cost across multiple operations, and the balance can quickly weigh in favor of the GPU.
It's the relative difference of transfer overhead vs degree of compute. For one single operation, sure, the transfer overhead dominates. Add multiple compute steps (operations) however, and experiments will show that the GPU is faster as the transfer cost is fixed.
Would be interesting to see what a unified memory setup can do like say an Apple M-series since this is the argument for unified memory, zero copy memory access between CPU and GPU.
Unified memory is what makes using the GPU viable in my use case (mobile). The copy operation is almost always the slowest part. This is especially true for real time work.
Yeah classic use cases of GPUs like deep learning have you transfer the weights for the entire model to your GPU(s) at the of inference and after you that you only transfer your input over.
The use case of transferring ALL data over every time is obviously misusing the GPU.
If anyone’s ever tried running a model that’s too large for your GPU you will have experienced how slow this is when you have to pull in the model in parts for a single inference run.
Terrible post, really. Did they need 5 pages to say that a silly dot product micro-bench that is PCIE-bound loses to a CPU implementation? Why are they even comparing computation vs computation + memory transfer?
Because going to the GPU adds the overhead of memory and this is a simple way to demonstrate that. Did you know that already? Congrats, you're in the top 10% of software engineers
You still don't need the unwieldy blog post to explain that. The SIMD section is unnecessary. Comparing the naive dot product on CPU vs GPU (+round-trip transfer) would have sufficed.
Question from someone who doesn't know enough about GPUs:
Recently a friend mentioned his workstation has 384 cores using 4 processors. This is starting to approach some of the core numbers of earlier GPUs.
Is there a possibility that in the not too distant future that GPUs and CPUs will just converge? Or are the tasks done by GPUs too specialized?
CPUs and GPUs are fundamentally aiming at different vertices of the performance polygon.
CPUs aim to minimize latency (how many cycles have to pass before you can use a result), and do so by way of high clock frequencies, caches and fancy micro architectural tricks. This is what you want in most general computation cases where you don't have other work to do whilst you wait.
GPUs instead just context switch to a different thread whilst waiting on a result. They hide their latency by making parallelism as cheap as possible. You can have many more cores running at a lower clock frequency and be more efficient as a result. But this only works if you have enough parallelism to keep everything busy whilst waiting for things to finish on other threads. As it happens that's pretty common in large matrix computations done in machine learning, so they're pretty popular there.
Will they converge? I don't think so - they're fundamentally different design points. But it may well be that they get integrated at a much closer level than current designs, pushing the heterogeneous/dark silicon/accelerator direction to an extreme.
GPU single threads are up to 20x slower the CPU threads. GPUs get their speed from massive parallelization, SIMD style "do N things 1 one instruction", and some specialized hardware (like texture samplers)
If you take a serial algorithm and put it on the GPU, it's easy to verify that a single GPU thread is much slower than a single thread on the CPU. For example, just do a bubble sort on the GPU with a single thread. I'm not even including the time to transfer data or read the result. You'll easily find the CPU is way faster.
The way you get GPU speed is by finding/designing algorithms that are massively parallel. There are lots of them. There are sorting solutions for example.
As an example, 100 cores * 32 execution units per core = 3200 / 20 = 160x faster than the CPU if you can figure out a parallel solution. But, not every problem can be solved with parallel solutions and if it can't then there's where the CPU wins.
It seems unlikely GPU threads will be as fast as CPU threads. They get their massive parallelism by being simpler.
This is totally inaccurate. There is nothing 20x slower about a GPU core unless you're nitpicking, in which case you can show those same GPU cores are 20x faster than CPU for other things.
Too specialized. You can't use GPUs as general purpose computers. The basic unit of operation is the warp, which is 32 threads operating in lockstep (simplified). If you're not using all 32 threads, then you may as well not be using a GPU.
They're really similar already. You can program a GPU much like you would a CPU (existence proof at https://news.ycombinator.com/item?id=42387267). There's a lot of obfuscation and hoarding of the ancient knowledge from the GPUs are special enthusiasts but the magic doesn't survive looking at the things. It's a masked vector isa.
My dev GPU is a 6800XT. Cheapish gaming card from a little while ago, 16GB ram on the card. 72 "compute units" which are independent blocks of hardware containing memory ports, floating point unit, register file etc. Roughly "a core" from x64 world. Each of those can have up to 64 tasks ready to go, roughly a "hyperthread". It's 300W or so.
There's some noise in the details, e.g. the size of the register file from the perspective of a hyperthread affects how many can be resident on the compute unit ready to run, the memory hierarchy has extra layers in it. The vector unit is 256byte wide as opposed to 64byte wide on x64.
But if you wanted to run a web browser entirely on the GPU and were sufficiently bloody minded you'd get it done, with the CPU routing keyboard I/O to it and nothing else. If you want a process to sit on the GPU talking to the network and crunching numbers, don't need the x64 or arm host to do anything at all.
You'd also need extreme hyperthreading. A GPU can cycle between several warps in the same execution unit (barrel-processor-style), padding out the time per instruction to hide memory latency, while still getting the same throughput. That's counter to the fundamental design of CPUs.
To save some time for people, the title is absolutely incorrect. The GPU is significantly faster for this test, but the author is measuring PCIe bandwidth from a PCIe generation of 10 years ago. If instead they had used pcie5 the bandwidth would be quadrupled.
But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.
I'd exploit the better multi-level cache hierarchy in CPUs and make the code NUMA aware. But still I wouldn't bet against a recent GPU card.
> But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.
The post is about dot product, not matrix multiply. Dot product has no data reuse
In this analogy, using better hardware affects the discount per gallon, but there are still situations where the closer gas station is the better choice.
L2 cache is 1000x closer than the PCIe bus. Pretty much anything that has to respond in ~realtime to outside events will run better on the CPU. You can use the GPU to visualize the state of the system with some small delay (e.g., video games), but it is not so great at modifying state in an efficient manner - especially when serialization of events & causality are important.
A good mental model is to compare the number of floats being processed -vs- the number of primitive computations. Matrix multiplication has n^3 computation with n^2 data. Multiplication of large matrices is therefore special in the potential for "data re-use" (each float is used against all the columns or rows of the other matrix) -- so systems are designed to have a much higher flops throughput than memory bandwidth. A dot product is at the other extreme, where each float is used only once (loosely).
Roofline plots [1] are framework to visualize system design from this perspective.
[1] https://en.wikipedia.org/wiki/Roofline_model
That property is the same reason you don't incur substantial overhead doing large matrix multiplications sharded over disks or many machines. You apply the same chunking strategies used to optimally use L1/L2/L3 caches, just instead at the level of numa nodes, physical disks, machines, and clusters. So long as each "cache" is big enough for the N^3/N^2 term to dominate communication overhead (especially if that communication can happen concurrently), the networked result is about as fast as the individual machines running at their max FLOPs for some smaller problem.
It looks like a dozen people found this helpful. As a related idea, it's one of the main reasons batch inference is so much more efficient in ML. You transmute the problem from memory-bound to compute-bound.
This is indeed a good moel. From the article's context, dGPUs have more of both bandwidth and flops, the compute-intensity balance isn't necessarily the deciding factor on whether there's a speedup on GPU.
In this article the deciding factor seems to be the startup cost because the application has placed the data on the CPU memory side, and is considering shipping out to GPU memory just for this computation.
You can modify the roofline model to include the PCIe bandwidth. It is sometimes called a hierarchical roofline model.
This is amplified even more by the fact that only the trivial implementation of matmul is O(n^3) whereas efficient ones (e.g BLAS) use things like the Strassen algorithm. You can also speed it up significantly by using cache-aware approaches when retrieving rows and columns. In practice there is a huge amount of theory behind this that is far beyond the average person's scope if they are not actual researchers.
Is there actually a BLAS implementation that uses strassen?
I don’t think it’s accurate that only trivial implementations use the direct o(n^3) algorithm. AFAIK high performance BLAS implementations just use highly optimized versions of it.
BLAS is just the library definition and not the implementation, so BLAS implementations could implement GEMM anyway they want. But in practice the triple loop method (n^3) is the most common, despite Strassen's and the more numerically stable, Winograd methods being well known and available for decades. But with most things involving real computing hardware, memory access patterns and locality tend to be more important for performance than operation counts
I remember reading that it’s too hard to get good memory bandwidth/l2 utilization in the fancy algorithms, you need to read contiguous blocks and be able to use them repeatedly. But I also haven’t looked at the gpu blas implementations directly.
AIUI, Strassen gets used moderately commonly with non-floating-point datatypes, where numerical stability is less of a concern and multiplications are more useful to minimize than memory traffic. But from what I can tell, every floating-point BLAS library eschews Strassen, despite a steady trickle of papers saying "hey, there might be some small wins if we go to Strassen!"
The big issue with Strassen isn't performance - it's numerical stability.
Not that long ago, I tried using the FFT to do matrix multiplication since it was supposed to be asymptomatically faster. It turns out that the constant factor is huge compared to the O(n^3) grade school algorithm that BLAS optimizes via tiling and other tricks. Even if it looks expensive on paper, the cubic algorithm is fast.
I just wish I understood the tricks done to make it so fast so I could implement my own for variations for which there are no pre-existing BLAS implementations. The best BLAS implementations are all closed source sadly.
> The best BLAS implementations are all closed source sadly.
NVidia open-sourced CUTLASS [0] some years ago and it achieves pretty competitive performance compared to e.g. the closed-source cuBLAS.
Keen observers will notice that Strassen is not used in CUTLASS.
[0] https://github.com/NVIDIA/cutlass
Using FFT to dot matmul is much more memory intensive, IIRC.
CUDNN supports FFT to do matmul as well as convolution/correlation and can also be configured to automatically use the best algorithm.
In some cases the FFT method has the incidental side-benefit of data reuse, like in the case of FIR filters, where the data allows for partitioned convolution.
Really? I thought that no practical linear algebra library used the Strassen algorithm. Can you provide a source?
The BLAS GEMM routines I have seen use normal blocked algorithms.
I don’t know what the real convention is, but IMO, BLAS GEMM “is” the O(n^3) algorithm (blocked is fine of course$ in the sense that something like Strassen has stability implications and isn’t appropriate for lots of sizes. Just swapping it in would be nuts, haha.
Strassen works best on power of 2 sized matrices. That's not a restriction you'd usually see in a general purpose library.
Ok strassen and others are better but they are still O(n^w) where 2 < w < 3
> Each version is severely memory bandwidth bottlenecked, the CUDA version suffers the most with its practical 11.8 GB/s device-to-host bandwidth due to its PCI-Express 3.0 x16 interface.
PCIe 3.0? What?
https://cowfreedom.de/#appendix/computer_specs/
> GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)
> Intel Core i5-8300H (2020)
This is a low-price 8 year old GPU and a 4 year old CPU. And he seems to be including loading the data to GPU. Newer cards have wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-Hopper.
Also he is comparing his own CUDA implementation. He should use one of the many available in CUBLAS/CUTLASS. Making a good CUDA GEMM is a very difficult art and very hardware specific
> Newer cards have wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-Hopper.
There aren't any (edit: consumer) GPUs with PCIe5 yet, though they probably aren't far off. Plenty already have PCIe4 though.
Consumer cards are PCIe 4.0 x16. H100 PCIe version is PCIe 5.0 https://en.wikipedia.org/wiki/Hopper_(microarchitecture). And it's been out 2 years already.
For the consumer GPUs, PCIe 4.0 x16 has plenty of BW headroom. The full sized x16 is more for stability reasons. Some vendors even put a couple of M.2 slots on PCI4/5 GPU board to recuperate the unused PCIe lanes.
That's mixing apples with oranges.
Some low-end or middle range GPUs really only use 8 lanes, because that's how the chip is designed. Fewer lanes -> less silicon area for the pcie logic needed -> cheaper.
The chips which use 16 lanes take advantage of it and can saturate the link.
Smaller GPU silicon is also often designed more around the laptop market, where it's common for the CPU to not have more than 8 lanes available for the GPU.
what does a laptop GPU have to do with a server grade xeon that the article is comparing against?
So everyone is supposed to do all of their testing on H100's?
4090 (2022) PCIe 4.0 x16 is quite decent. The major limit is memory, not bandwidth. And 3090 (2020) is also PCIe 4.0 x16, and used cards are a bargain. You can hook them up with Nvlink.
Nvidia is withholding new releases but the current hardware has more legs with new matrix implementations. Like FlashAttention doing some significant improvement every 6 months.
Nvidia could make consumer chips with combined CPU-GPU. I guess they are too busy making money with the big cloud. Maybe somebody will pick up. Apple is already doing something like it even on laptops.
get a GH100 on lambda and behold you have 900GB/s between CPU memory and GPU, and forget PCIe.
That doesn't add up, because you're now exceeding the memory bandwidth of the memory controller. I.e. it would be faster to do everything, including the CPU only algorithms, in far away VRAM.
latency might be higher and latency usually dominates CPU side algorithms.
Where are you seeing 900 GB/s?
The 900 GB/s are a figure often cited for Hopper based SXM boards, it's the aggregate of 18 NVLink connections. So it's more of a many-to-many GPU-to-GPU bandwidth figure.
https://www.datacenterknowledge.com/data-center-hardware/nvi...
this is bidirectional bw and it's gh200, not gh100
TIL, I missed Hopper already having it. I assume the RTX 5000 series will bring it to consumers.
Well let me tell you about the pretty high-end and expensive L40 that shipped with PCIe-4.0 to my utter dismay and disgust. Only the H100 had 5.0 although I could already saturate 4.0 (and 5.0) with Mellanox NICs and GPU/StorageDirect. Waiting for the next one to maybe get 5.0.
Supposedly the Intel B580 releasing friday will use PCIe 5.0 8x.
PCIe 4.0 16x
https://www.intel.com/content/www/us/en/products/sku/227961/...
That's A580. Correct link https://www.intel.com/content/www/us/en/products/sku/241598/...
"PCI Express 4.0 x8"
So Intel is doing the same artificial performance limitation. Nvidia 4060s are also PCIe 4.0 with half lanes (x8). Argh.
Alternatively, at that performance level it doesn't make a difference, so better to save the $5....
that's not true, H100 NVL is PCIe gen5 x16.
The CPU was launched in Q2 2018, so that is also 6 years. I wonder what the outcome will be with a CPU that supports AVX-512 and a more recent GPU.
GH100 can do 900GB/s HtoD.
And both 3090 and 4090 can do 32 GB/s host-device. Not far from CPU-RAM. You only load the matrix once. The bandwidth for the matmul is orders of magnitude larger and happens all in device and mostly in cache.
No it can’t. That’s d to d
no.
[root@gh200 nvbandwidth]# nvidia-smi | grep GH200
| 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | 0 |
[root@gh200 nvbandwidth]# ./nvbandwidth | grep -E 'host_to_device_memcpy_sm|device_to_host_memcpy_sm' | grep ^SUM
SUM host_to_device_memcpy_sm 357.45
SUM device_to_host_memcpy_sm 352.05
[root@gh200 nvbandwidth]#
d2d is much higher:
[root@gh200 cuda-samples]# ./bin/sbsa/linux/release/bandwidthTest --dtod | grep -B1 32000
[root@gh200 cuda-samples]#Well, device to device is technically doubled because you have a read and a write. But yes
No. D2D is arroubd 3TB/s
There is the compute vs communicate ratio.
For problems like Matrix Multiplication, it costs N to communicate the problem but N^2 operations to calculate.
For problems like dot product, it costs N to communicate but only N operations to calculate.
Compute must be substantially larger than communication costs if you hope to see any benefits. Asymptotic differences obviously help, but linear too might help.
You'd never transfer N data to perform a log(n) binary search for example. At that point communication dominates.
For those skimming and to add to the above the article is using the gpu to work with system memory since that’s where they have the initial data and where they want the result in this case and comparing it to a cpu doing the same. The entire bottleneck is GPU to system memory.
If you’re willing to work entirely with the gpu memory the gpu will of course be faster even in this scenario.
Assuming your task is larger than kernel launch overhead, of course.
Sure, but once you've justified moving data onto the GPU you don't want to incur the cost of moving the operation output back to the CPU unless you have to. So, for example, you might justify moving data to the GPU for a neural net convolution, but then also execute the following activation function (& subsequent operators) there because that's now where the data is.
So awhile back I was working on a chaotic renderer. This let me to a really weird set of situations:
* If the GPU is a non-dedicated older style intel GPU, use CPU
* If the GPU is a non-dedicated anything else do anything super parallel on the GPU but anything that can be BRRRRRT via CPU on the CPU because the memory is shared.
* If the GPU is dedicated move everything to GPU memory and keep it there and only pull back small statistics if at all plausible.
Comparing multicore wide AVX to CUDA is a bit of an unnecessary nuance for most folks. These make sense, but miss the forest from the trees:
- Either way, you're writing 'cuda style' fine-grained data parallel code that looks and works very different from regular multithreaded code. You are now in a different software universe.
- You now also have to think about throughput, latency hiding, etc. Nvidia has been commoditizing throughput-oriented hardware a lot better than others, and while AMD is catching up on some workloads, Nvidia is already advancing. This is where we think about bandwidth between network/disk=>compute unit. My best analogy here, when looking at things like GPU Direct Storage/Network, is CPU systems feel like a long twisty straw, while GPU paths are fat pipes. Big compute typically needs both compute + IO, and hardware specs tell you the bandwidth ceiling.
To a large extent, ideas are cross-polinating -- CPUs looking more like GPUs, and GPUs getting the flexibility of CPUs -- but either way, you're in a different universe of how code & hardware works than 1990s & early 2000s intel.
Realistically you should use Numpy or Cupy (or whatever the appropriate/fashionable library is) anyway, because tuning this stuff is a big pain.
So, GPUs have the slight disadvantages that you have to think an about data movement and the drivers are a little less convenient to install, but it isn’t really a big deal.
I am a big fan of jax for numerical computations these days.
I’ve been seeing lots of posts about it lately. Haven’t had a chance to try it out, though.
Agreed! The bigger shift is switching to data parallel coding styles.
TBH, I'm finding that people underestimate the usefulness of CPU in both inference and fine tuning. PEFT with access to 64GB+ RAM and lots of cores can sometimes be cost effective.
I think engineers learn this quickly in high-scale/performance production environments. Even without hardware backgrounds. SLAs/costs create constraints you need to optimize against after promising the business line these magical models can enable that cool new feature for a million users.
Traditional AI/ML models (including smaller transformers) can definitely be optimized for mass scale/performance on cpu-optimized infrastructure.
For simple operations like the dot product (that also map extremely well to SIMD operations), yes, the CPU is often better, as there not much actual "computation" being done. More complex computations where the data does not need to transfer between the host and device amortize that transfer cost across multiple operations, and the balance can quickly weigh in favor of the GPU.
It's the relative difference of transfer overhead vs degree of compute. For one single operation, sure, the transfer overhead dominates. Add multiple compute steps (operations) however, and experiments will show that the GPU is faster as the transfer cost is fixed.
Would be interesting to see what a unified memory setup can do like say an Apple M-series since this is the argument for unified memory, zero copy memory access between CPU and GPU.
Unified memory is what makes using the GPU viable in my use case (mobile). The copy operation is almost always the slowest part. This is especially true for real time work.
Even the integrated Intel HD Graphics would be an interesting comparison.
Simpler research could've shown that there is a physical data transfer cost.
Yeah classic use cases of GPUs like deep learning have you transfer the weights for the entire model to your GPU(s) at the of inference and after you that you only transfer your input over.
The use case of transferring ALL data over every time is obviously misusing the GPU.
If anyone’s ever tried running a model that’s too large for your GPU you will have experienced how slow this is when you have to pull in the model in parts for a single inference run.
Taking a helicopter is not always faster than walking.
Is this surprising or obvious?
An otherwise valid point made using a terrible example.
Terrible post, really. Did they need 5 pages to say that a silly dot product micro-bench that is PCIE-bound loses to a CPU implementation? Why are they even comparing computation vs computation + memory transfer?
Because going to the GPU adds the overhead of memory and this is a simple way to demonstrate that. Did you know that already? Congrats, you're in the top 10% of software engineers
You still don't need the unwieldy blog post to explain that. The SIMD section is unnecessary. Comparing the naive dot product on CPU vs GPU (+round-trip transfer) would have sufficed.
But how will people be impressed if you don't exhaust them with jargon?
Question from someone who doesn't know enough about GPUs: Recently a friend mentioned his workstation has 384 cores using 4 processors. This is starting to approach some of the core numbers of earlier GPUs.
Is there a possibility that in the not too distant future that GPUs and CPUs will just converge? Or are the tasks done by GPUs too specialized?
CPUs and GPUs are fundamentally aiming at different vertices of the performance polygon.
CPUs aim to minimize latency (how many cycles have to pass before you can use a result), and do so by way of high clock frequencies, caches and fancy micro architectural tricks. This is what you want in most general computation cases where you don't have other work to do whilst you wait.
GPUs instead just context switch to a different thread whilst waiting on a result. They hide their latency by making parallelism as cheap as possible. You can have many more cores running at a lower clock frequency and be more efficient as a result. But this only works if you have enough parallelism to keep everything busy whilst waiting for things to finish on other threads. As it happens that's pretty common in large matrix computations done in machine learning, so they're pretty popular there.
Will they converge? I don't think so - they're fundamentally different design points. But it may well be that they get integrated at a much closer level than current designs, pushing the heterogeneous/dark silicon/accelerator direction to an extreme.
GPU single threads are up to 20x slower the CPU threads. GPUs get their speed from massive parallelization, SIMD style "do N things 1 one instruction", and some specialized hardware (like texture samplers)
If you take a serial algorithm and put it on the GPU, it's easy to verify that a single GPU thread is much slower than a single thread on the CPU. For example, just do a bubble sort on the GPU with a single thread. I'm not even including the time to transfer data or read the result. You'll easily find the CPU is way faster.
The way you get GPU speed is by finding/designing algorithms that are massively parallel. There are lots of them. There are sorting solutions for example.
As an example, 100 cores * 32 execution units per core = 3200 / 20 = 160x faster than the CPU if you can figure out a parallel solution. But, not every problem can be solved with parallel solutions and if it can't then there's where the CPU wins.
It seems unlikely GPU threads will be as fast as CPU threads. They get their massive parallelism by being simpler.
That said, who knows what the future holds.
This is totally inaccurate. There is nothing 20x slower about a GPU core unless you're nitpicking, in which case you can show those same GPU cores are 20x faster than CPU for other things.
Do you have examples of a single GPU thread outperforming a single CPU thread?
Here's mine
https://jsfiddle.net/jw7a6to9/ bubblesort
https://jsfiddle.net/y1w6s9tj/ taylor series
> in which case you can show those same GPU cores are 20x faster than CPU for other things
Which things? Remember, I wrote single thread, no SIMD, no samplers. It's the parallelism that provides the speed.
Too specialized. You can't use GPUs as general purpose computers. The basic unit of operation is the warp, which is 32 threads operating in lockstep (simplified). If you're not using all 32 threads, then you may as well not be using a GPU.
They're really similar already. You can program a GPU much like you would a CPU (existence proof at https://news.ycombinator.com/item?id=42387267). There's a lot of obfuscation and hoarding of the ancient knowledge from the GPUs are special enthusiasts but the magic doesn't survive looking at the things. It's a masked vector isa.
My dev GPU is a 6800XT. Cheapish gaming card from a little while ago, 16GB ram on the card. 72 "compute units" which are independent blocks of hardware containing memory ports, floating point unit, register file etc. Roughly "a core" from x64 world. Each of those can have up to 64 tasks ready to go, roughly a "hyperthread". It's 300W or so.
There's some noise in the details, e.g. the size of the register file from the perspective of a hyperthread affects how many can be resident on the compute unit ready to run, the memory hierarchy has extra layers in it. The vector unit is 256byte wide as opposed to 64byte wide on x64.
But if you wanted to run a web browser entirely on the GPU and were sufficiently bloody minded you'd get it done, with the CPU routing keyboard I/O to it and nothing else. If you want a process to sit on the GPU talking to the network and crunching numbers, don't need the x64 or arm host to do anything at all.
You'd also need extreme hyperthreading. A GPU can cycle between several warps in the same execution unit (barrel-processor-style), padding out the time per instruction to hide memory latency, while still getting the same throughput. That's counter to the fundamental design of CPUs.
Memory locality depends on your perspective.
The CPU would always be slower if the data originated in GPU memory.
Not necessarily, i'm sure one could construct scenarios where offloading to the CPU makes sense. Think of highly branchy double precision stuff.
This article is a CPU benchmark and a PCI express bandwidth benchmark. It's masquerading as something else though.
To save some time for people, the title is absolutely incorrect. The GPU is significantly faster for this test, but the author is measuring PCIe bandwidth from a PCIe generation of 10 years ago. If instead they had used pcie5 the bandwidth would be quadrupled.
how can the multicore AVX implementation do a dot product (for arrays much larger than cache) at 340 GB/s on a system with RAM bandwidth < 50 GB/s
Answer: it can’t.
The author has updated the post with corrected AVX measurements, with the original ~340 GB/s revised down to 31.7 GB/s. (Thanks CowFreedom)
I think the post is a bit disingenuous.
But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.
I'd exploit the better multi-level cache hierarchy in CPUs and make the code NUMA aware. But still I wouldn't bet against a recent GPU card.
> But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.
The post is about dot product, not matrix multiply. Dot product has no data reuse
The GPU is never faster. It's parallel.
The same thing applies to using a GPU to do inference with your weights in system memory. That is why nobody does that.
Even if the GPU took literally no time at all to compute the results there would be workflows where doing it on the CPU was faster.
The GPU is the gas station across town that’s five cents cheaper.
No, it’s the bulk order from China that’s 100x cheaper but has a minimum order quantity of 100000 units and takes 6-8 weeks to get here.
this is simply not true. the post just uses outdated hardware, as pointed out above.
a GH200 will run miles around any CPU.
The gist of the post is that optimizations and interpretations thereof must always be made with respect to the underlying hardware.
In this analogy, using better hardware affects the discount per gallon, but there are still situations where the closer gas station is the better choice.
Good analogy.
L2 cache is 1000x closer than the PCIe bus. Pretty much anything that has to respond in ~realtime to outside events will run better on the CPU. You can use the GPU to visualize the state of the system with some small delay (e.g., video games), but it is not so great at modifying state in an efficient manner - especially when serialization of events & causality are important.
> ~realtime to outside events
well, to play the devil advocate, for the outside event to affect the CPU, a signal will have to go through the PCI bus or equivalent.