What Shapes Do Matrix Multiplications Like?

(thonking.ai)

164 points | by skidrow 8 months ago ago

10 comments

  • jabl 8 months ago

    I recall some optimization advice to choose a leading dimension that is NOT a power of two, in order to avoid cache set associativity conflicts. Guess it's highly hardware dependent (in particular, that advice was for cpu's not GPU's).

    • adrian_b 8 months ago

      Many Intel CPUs have caches that have a number of ways that is not a power of two, but it contains a 3 or 5 factor, corresponding to cache sizes like 2.5 MB, 3 MB, 24 MB or 36 MB.

      Regardless of the cache associativity, the leading dimension must always correspond to a size that is a multiple of the cache line size (64 bytes for most CPUs) and the matrix must be aligned to the cache line size.

      With that condition fulfilled, the total size of a row or of a column (depending on whether the matrix uses C order or Fortran order) may be optimal as a power of two or not, which should be better tested experimentally on the target CPUs or GPUs.

      • muziq 8 months ago

        DRAM page size surely ? All those cheap CASs’ ?

    • dekhn 8 months ago

      That only makes sense if you know for sure your application is running on a specific architecture. Otherwise, it's a highly specific optimization that is bound to violate another architecture's design, also would be extremely challenging to debug.

  • stoniejohnson 8 months ago

    Great post! I appreciate the diagrams.

  • amelius 8 months ago

    TL;DR: make sure your matrix dimensions are divisible by 2 often.

    • chillee 8 months ago

      Well, that'll help with a lot :) But dealing with wave quantization requires dimensions that aren't neceessarily a multiple of 2, and often are a multiple of the number of SMs on a GPU (i.e. 132 on an H100)

    • ykonstant 8 months ago

      I have always done that instinctively, even when there is no formal demand by the system. Every time I have implemented matmuls, at any level, with any optimization requirement, partitioning into dyadic blocks had always sped things up. So I try to feed the system the nicest numbers I can muster!

      • zusammen 8 months ago

        D&C approaches are applicable to lots of problems and, as you’ve noticed, tend to do well with “round” (in binary) numbers.

    • carlmr 8 months ago

      And if you can't make sure all of them are divisible by 2 often, at least pick the inner dimensions of your matmuls (if possible).