The GPU is not always faster

(cowfreedom.de)

107 points | by CowFreedom 12 hours ago ago

60 comments

  • ssivark 10 hours ago

    A good mental model is to compare the number of floats being processed -vs- the number of primitive computations. Matrix multiplication has n^3 computation with n^2 data. Multiplication of large matrices is therefore special in the potential for "data re-use" (each float is used against all the columns or rows of the other matrix) -- so systems are designed to have a much higher flops throughput than memory bandwidth. A dot product is at the other extreme, where each float is used only once (loosely).

    Roofline plots [1] are framework to visualize system design from this perspective.

    [1] https://en.wikipedia.org/wiki/Roofline_model

    • hansvm 8 hours ago

      That property is the same reason you don't incur substantial overhead doing large matrix multiplications sharded over disks or many machines. You apply the same chunking strategies used to optimally use L1/L2/L3 caches, just instead at the level of numa nodes, physical disks, machines, and clusters. So long as each "cache" is big enough for the N^3/N^2 term to dominate communication overhead (especially if that communication can happen concurrently), the networked result is about as fast as the individual machines running at their max FLOPs for some smaller problem.

    • sigmoid10 10 hours ago

      This is amplified even more by the fact that only the trivial implementation of matmul is O(n^3) whereas efficient ones (e.g BLAS) use things like the Strassen algorithm. You can also speed it up significantly by using cache-aware approaches when retrieving rows and columns. In practice there is a huge amount of theory behind this that is far beyond the average person's scope if they are not actual researchers.

      • rnrn 10 hours ago

        Is there actually a BLAS implementation that uses strassen?

        I don’t think it’s accurate that only trivial implementations use the direct o(n^3) algorithm. AFAIK high performance BLAS implementations just use highly optimized versions of it.

        • leecarraher 6 hours ago

          BLAS is just the library definition and not the implementation, so BLAS implementations could implement GEMM anyway they want. But in practice the triple loop method (n^3) is the most common, despite Strassen's and the more numerically stable, Winograd methods being well known and available for decades. But with most things involving real computing hardware, memory access patterns and locality tend to be more important for performance than operation counts

        • jcranmer 8 hours ago

          AIUI, Strassen gets used moderately commonly with non-floating-point datatypes, where numerical stability is less of a concern and multiplications are more useful to minimize than memory traffic. But from what I can tell, every floating-point BLAS library eschews Strassen, despite a steady trickle of papers saying "hey, there might be some small wins if we go to Strassen!"

          • chillee 6 hours ago

            The big issue with Strassen isn't performance - it's numerical stability.

        • chessgecko 9 hours ago

          I remember reading that it’s too hard to get good memory bandwidth/l2 utilization in the fancy algorithms, you need to read contiguous blocks and be able to use them repeatedly. But I also haven’t looked at the gpu blas implementations directly.

      • ryao 3 hours ago

        Not that long ago, I tried using the FFT to do matrix multiplication since it was supposed to be asymptomatically faster. It turns out that the constant factor is huge compared to the O(n^3) grade school algorithm that BLAS optimizes via tiling and other tricks. Even if it looks expensive on paper, the cubic algorithm is fast.

        I just wish I understood the tricks done to make it so fast so I could implement my own for variations for which there are no pre-existing BLAS implementations. The best BLAS implementations are all closed source sadly.

      • sestep 10 hours ago

        Really? I thought that no practical linear algebra library used the Strassen algorithm. Can you provide a source?

        • CowFreedom 10 hours ago

          The BLAS GEMM routines I have seen use normal blocked algorithms.

          • bee_rider 7 hours ago

            I don’t know what the real convention is, but IMO, BLAS GEMM “is” the O(n^3) algorithm (blocked is fine of course$ in the sense that something like Strassen has stability implications and isn’t appropriate for lots of sizes. Just swapping it in would be nuts, haha.

  • alecco 9 hours ago

    > Each version is severely memory bandwidth bottlenecked, the CUDA version suffers the most with its practical 11.8 GB/s device-to-host bandwidth due to its PCI-Express 3.0 x16 interface.

    PCIe 3.0? What?

    https://cowfreedom.de/#appendix/computer_specs/

    > GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)

    > Intel Core i5-8300H (2020)

    This is a low-price 8 year old GPU and a 4 year old CPU. And he seems to be including loading the data to GPU. Newer cards have wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-Hopper.

    Also he is comparing his own CUDA implementation. He should use one of the many available in CUBLAS/CUTLASS. Making a good CUDA GEMM is a very difficult art and very hardware specific

    • jsheard 9 hours ago

      > Newer cards have wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-Hopper.

      There aren't any (edit: consumer) GPUs with PCIe5 yet, though they probably aren't far off. Plenty already have PCIe4 though.

      • alecco 9 hours ago

        Consumer cards are PCIe 4.0 x16. H100 PCIe version is PCIe 5.0 https://en.wikipedia.org/wiki/Hopper_(microarchitecture). And it's been out 2 years already.

        • yvdriess 6 hours ago

          For the consumer GPUs, PCIe 4.0 x16 has plenty of BW headroom. The full sized x16 is more for stability reasons. Some vendors even put a couple of M.2 slots on PCI4/5 GPU board to recuperate the unused PCIe lanes.

        • jsheard 8 hours ago

          TIL, I missed Hopper already having it. I assume the RTX 5000 series will bring it to consumers.

        • throwway120385 8 hours ago

          So everyone is supposed to do all of their testing on H100's?

          • alecco 7 hours ago

            4090 (2022) PCIe 4.0 x16 is quite decent. The major limit is memory, not bandwidth. And 3090 (2020) is also PCIe 4.0 x16, and used cards are a bargain. You can hook them up with Nvlink.

            Nvidia is withholding new releases but the current hardware has more legs with new matrix implementations. Like FlashAttention doing some significant improvement every 6 months.

            Nvidia could make consumer chips with combined CPU-GPU. I guess they are too busy making money with the big cloud. Maybe somebody will pick up. Apple is already doing something like it even on laptops.

          • _zoltan_ 7 hours ago

            get a GH100 on lambda and behold you have 900GB/s between CPU memory and GPU, and forget PCIe.

        • touisteur 4 hours ago

          Well let me tell you about the pretty high-end and expensive L40 that shipped with PCIe-4.0 to my utter dismay and disgust. Only the H100 had 5.0 although I could already saturate 4.0 (and 5.0) with Mellanox NICs and GPU/StorageDirect. Waiting for the next one to maybe get 5.0.

      • goosedragons 8 hours ago

        Supposedly the Intel B580 releasing friday will use PCIe 5.0 8x.

      • _zoltan_ 7 hours ago

        that's not true, H100 NVL is PCIe gen5 x16.

    • _zoltan_ 7 hours ago

      GH100 can do 900GB/s HtoD.

      • alecco 7 hours ago

        And both 3090 and 4090 can do 32 GB/s host-device. Not far from CPU-RAM. You only load the matrix once. The bandwidth for the matmul is orders of magnitude larger and happens all in device and mostly in cache.

      • jing 2 hours ago

        No it can’t. That’s d to d

  • dragontamer 10 hours ago

    There is the compute vs communicate ratio.

    For problems like Matrix Multiplication, it costs N to communicate the problem but N^2 operations to calculate.

    For problems like dot product, it costs N to communicate but only N operations to calculate.

    Compute must be substantially larger than communication costs if you hope to see any benefits. Asymptotic differences obviously help, but linear too might help.

    You'd never transfer N data to perform a log(n) binary search for example. At that point communication dominates.

    • AnotherGoodName 10 hours ago

      For those skimming and to add to the above the article is using the gpu to work with system memory since that’s where they have the initial data and where they want the result in this case and comparing it to a cpu doing the same. The entire bottleneck is GPU to system memory.

      If you’re willing to work entirely with the gpu memory the gpu will of course be faster even in this scenario.

    • HarHarVeryFunny 8 hours ago

      Sure, but once you've justified moving data onto the GPU you don't want to incur the cost of moving the operation output back to the CPU unless you have to. So, for example, you might justify moving data to the GPU for a neural net convolution, but then also execute the following activation function (& subsequent operators) there because that's now where the data is.

  • lmeyerov 10 hours ago

    Comparing multicore wide AVX to CUDA is a bit of an unnecessary nuance for most folks. These make sense, but miss the forest from the trees:

    - Either way, you're writing 'cuda style' fine-grained data parallel code that looks and works very different from regular multithreaded code. You are now in a different software universe.

    - You now also have to think about throughput, latency hiding, etc. Nvidia has been commoditizing throughput-oriented hardware a lot better than others, and while AMD is catching up on some workloads, Nvidia is already advancing. This is where we think about bandwidth between network/disk=>compute unit. My best analogy here, when looking at things like GPU Direct Storage/Network, is CPU systems feel like a long twisty straw, while GPU paths are fat pipes. Big compute typically needs both compute + IO, and hardware specs tell you the bandwidth ceiling.

    To a large extent, ideas are cross-polinating -- CPUs looking more like GPUs, and GPUs getting the flexibility of CPUs -- but either way, you're in a different universe of how code & hardware works than 1990s & early 2000s intel.

    • bee_rider 8 hours ago

      Realistically you should use Numpy or Cupy (or whatever the appropriate/fashionable library is) anyway, because tuning this stuff is a big pain.

      So, GPUs have the slight disadvantages that you have to think an about data movement and the drivers are a little less convenient to install, but it isn’t really a big deal.

      • lmeyerov 2 hours ago

        Agreed! The bigger shift is switching to data parallel coding styles.

      • PartiallyTyped an hour ago

        I am a big fan of jax for numerical computations these days.

  • glitchc 10 hours ago

    It's the relative difference of transfer overhead vs degree of compute. For one single operation, sure, the transfer overhead dominates. Add multiple compute steps (operations) however, and experiments will show that the GPU is faster as the transfer cost is fixed.

  • shihab 29 minutes ago

    An otherwise valid point made using a terrible example.

  • tylermw 10 hours ago

    For simple operations like the dot product (that also map extremely well to SIMD operations), yes, the CPU is often better, as there not much actual "computation" being done. More complex computations where the data does not need to transfer between the host and device amortize that transfer cost across multiple operations, and the balance can quickly weigh in favor of the GPU.

  • jsight 9 hours ago

    TBH, I'm finding that people underestimate the usefulness of CPU in both inference and fine tuning. PEFT with access to 64GB+ RAM and lots of cores can sometimes be cost effective.

    • ramoz 5 hours ago

      I think engineers learn this quickly in high-scale/performance production environments. Even without hardware backgrounds. SLAs/costs create constraints you need to optimize against after promising the business line these magical models can enable that cool new feature for a million users.

      Traditional AI/ML models (including smaller transformers) can definitely be optimized for mass scale/performance on cpu-optimized infrastructure.

  • hangonhn 9 hours ago

    Question from someone who doesn't know enough about GPUs: Recently a friend mentioned his workstation has 384 cores using 4 processors. This is starting to approach some of the core numbers of earlier GPUs.

    Is there a possibility that in the not too distant future that GPUs and CPUs will just converge? Or are the tasks done by GPUs too specialized?

    • JonChesterfield 8 hours ago

      They're really similar already. You can program a GPU much like you would a CPU (existence proof at https://news.ycombinator.com/item?id=42387267). There's a lot of obfuscation and hoarding of the ancient knowledge from the GPUs are special enthusiasts but the magic doesn't survive looking at the things. It's a masked vector isa.

      My dev GPU is a 6800XT. Cheapish gaming card from a little while ago, 16GB ram on the card. 72 "compute units" which are independent blocks of hardware containing memory ports, floating point unit, register file etc. Roughly "a core" from x64 world. Each of those can have up to 64 tasks ready to go, roughly a "hyperthread". It's 300W or so.

      There's some noise in the details, e.g. the size of the register file from the perspective of a hyperthread affects how many can be resident on the compute unit ready to run, the memory hierarchy has extra layers in it. The vector unit is 256byte wide as opposed to 64byte wide on x64.

      But if you wanted to run a web browser entirely on the GPU and were sufficiently bloody minded you'd get it done, with the CPU routing keyboard I/O to it and nothing else. If you want a process to sit on the GPU talking to the network and crunching numbers, don't need the x64 or arm host to do anything at all.

    • krapht 8 hours ago

      Too specialized. You can't use GPUs as general purpose computers. The basic unit of operation is the warp, which is 32 threads operating in lockstep (simplified). If you're not using all 32 threads, then you may as well not be using a GPU.

    • immibis 8 hours ago

      You'd also need extreme hyperthreading. A GPU can cycle between several warps in the same execution unit (barrel-processor-style), padding out the time per instruction to hide memory latency, while still getting the same throughput. That's counter to the fundamental design of CPUs.

  • ramoz 10 hours ago

    Simpler research could've shown that there is a physical data transfer cost.

    • hershey890 2 hours ago

      Yeah classic use cases of GPUs like deep learning have you transfer the weights for the entire model to your GPU(s) at the of inference and after you that you only transfer your input over.

      The use case of transferring ALL data over every time is obviously misusing the GPU.

      If anyone’s ever tried running a model that’s too large for your GPU you will have experienced how slow this is when you have to pull in the model in parts for a single inference run.

  • ryao 3 hours ago

    The same thing applies to using a GPU to do inference with your weights in system memory. That is why nobody does that.

  • rnrn 9 hours ago

    how can the multicore AVX implementation do a dot product (for arrays much larger than cache) at 340 GB/s on a system with RAM bandwidth < 50 GB/s

    • alecco 9 hours ago

      I think the post is a bit disingenuous.

      But about bandwidth, matrix multiplications happen mostly in cache and that has a lot more bandwidth than RAM. Blocks of the matrix are loaded to cache (explicitly in CUDA) and used multiple times there.

      I'd exploit the better multi-level cache hierarchy in CPUs and make the code NUMA aware. But still I wouldn't bet against a recent GPU card.

  • ltbarcly3 7 hours ago

    Taking a helicopter is not always faster than walking.

    Is this surprising or obvious?

  • SigmundA 10 hours ago

    Would be interesting to see what a unified memory setup can do like say an Apple M-series since this is the argument for unified memory, zero copy memory access between CPU and GPU.

    • mirsadm 8 hours ago

      Unified memory is what makes using the GPU viable in my use case (mobile). The copy operation is almost always the slowest part. This is especially true for real time work.

    • CowFreedom 10 hours ago

      Even the integrated Intel HD Graphics would be an interesting comparison.

  • moomin 10 hours ago

    Even if the GPU took literally no time at all to compute the results there would be workflows where doing it on the CPU was faster.

    • bob1029 2 hours ago

      L2 cache is 1000x closer than the PCIe bus. Pretty much anything that has to respond in ~realtime to outside events will run better on the CPU. You can use the GPU to visualize the state of the system with some small delay (e.g., video games), but it is not so great at modifying state in an efficient manner - especially when serialization of events & causality are important.

    • Waterluvian 10 hours ago

      The GPU is the gas station across town that’s five cents cheaper.

      • Legend2440 3 hours ago

        No, it’s the bulk order from China that’s 100x cheaper but has a minimum order quantity of 100000 units and takes 6-8 weeks to get here.

      • _zoltan_ 7 hours ago

        this is simply not true. the post just uses outdated hardware, as pointed out above.

        a GH200 will run miles around any CPU.

        • CowFreedom 5 hours ago

          The gist of the post is that optimizations and interpretations thereof must always be made with respect to the underlying hardware.

        • Dylan16807 3 hours ago

          In this analogy, using better hardware affects the discount per gallon, but there are still situations where the closer gas station is the better choice.

      • adamc 10 hours ago

        Good analogy.