What Shapes Do Matrix Multiplications Like?

(thonking.ai)

164 points | by skidrow 8 months ago ago

10 comments

jabl 8 months ago

I recall some optimization advice to choose a leading dimension that is NOT a power of two, in order to avoid cache set associativity conflicts. Guess it's highly hardware dependent (in particular, that advice was for cpu's not GPU's).

[-]

adrian_b 8 months ago

Many Intel CPUs have caches that have a number of ways that is not a power of two, but it contains a 3 or 5 factor, corresponding to cache sizes like 2.5 MB, 3 MB, 24 MB or 36 MB.

Regardless of the cache associativity, the leading dimension must always correspond to a size that is a multiple of the cache line size (64 bytes for most CPUs) and the matrix must be aligned to the cache line size.

With that condition fulfilled, the total size of a row or of a column (depending on whether the matrix uses C order or Fortran order) may be optimal as a power of two or not, which should be better tested experimentally on the target CPUs or GPUs.

[-]

muziq 8 months ago

DRAM page size surely ? All those cheap CASs’ ?

dekhn 8 months ago

That only makes sense if you know for sure your application is running on a specific architecture. Otherwise, it's a highly specific optimization that is bound to violate another architecture's design, also would be extremely challenging to debug.

stoniejohnson 8 months ago

Great post! I appreciate the diagrams.

amelius 8 months ago

TL;DR: make sure your matrix dimensions are divisible by 2 often.

[-]

chillee 8 months ago

Well, that'll help with a lot :) But dealing with wave quantization requires dimensions that aren't neceessarily a multiple of 2, and often are a multiple of the number of SMs on a GPU (i.e. 132 on an H100)

ykonstant 8 months ago

I have always done that instinctively, even when there is no formal demand by the system. Every time I have implemented matmuls, at any level, with any optimization requirement, partitioning into dyadic blocks had always sped things up. So I try to feed the system the nicest numbers I can muster!

[-]

zusammen 8 months ago

D&C approaches are applicable to lots of problems and, as you’ve noticed, tend to do well with “round” (in binary) numbers.

carlmr 8 months ago

And if you can't make sure all of them are divisible by 2 often, at least pick the inner dimensions of your matmuls (if possible).