I recall some optimization advice to choose a leading dimension that is NOT a power of two, in order to avoid cache set associativity conflicts. Guess it's highly hardware dependent (in particular, that advice was for cpu's not GPU's).
Many Intel CPUs have caches that have a number of ways that is not a power of two, but it contains a 3 or 5 factor, corresponding to cache sizes like 2.5 MB, 3 MB, 24 MB or 36 MB.
Regardless of the cache associativity, the leading dimension must always correspond to a size that is a multiple of the cache line size (64 bytes for most CPUs) and the matrix must be aligned to the cache line size.
With that condition fulfilled, the total size of a row or of a column (depending on whether the matrix uses C order or Fortran order) may be optimal as a power of two or not, which should be better tested experimentally on the target CPUs or GPUs.
That only makes sense if you know for sure your application is running on a specific architecture. Otherwise, it's a highly specific optimization that is bound to violate another architecture's design, also would be extremely challenging to debug.
Well, that'll help with a lot :) But dealing with wave quantization requires dimensions that aren't neceessarily a multiple of 2, and often are a multiple of the number of SMs on a GPU (i.e. 132 on an H100)
I have always done that instinctively, even when there is no formal demand by the system. Every time I have implemented matmuls, at any level, with any optimization requirement, partitioning into dyadic blocks had always sped things up. So I try to feed the system the nicest numbers I can muster!
I recall some optimization advice to choose a leading dimension that is NOT a power of two, in order to avoid cache set associativity conflicts. Guess it's highly hardware dependent (in particular, that advice was for cpu's not GPU's).
Many Intel CPUs have caches that have a number of ways that is not a power of two, but it contains a 3 or 5 factor, corresponding to cache sizes like 2.5 MB, 3 MB, 24 MB or 36 MB.
Regardless of the cache associativity, the leading dimension must always correspond to a size that is a multiple of the cache line size (64 bytes for most CPUs) and the matrix must be aligned to the cache line size.
With that condition fulfilled, the total size of a row or of a column (depending on whether the matrix uses C order or Fortran order) may be optimal as a power of two or not, which should be better tested experimentally on the target CPUs or GPUs.
DRAM page size surely ? All those cheap CASs’ ?
That only makes sense if you know for sure your application is running on a specific architecture. Otherwise, it's a highly specific optimization that is bound to violate another architecture's design, also would be extremely challenging to debug.
Great post! I appreciate the diagrams.
TL;DR: make sure your matrix dimensions are divisible by 2 often.
Well, that'll help with a lot :) But dealing with wave quantization requires dimensions that aren't neceessarily a multiple of 2, and often are a multiple of the number of SMs on a GPU (i.e. 132 on an H100)
I have always done that instinctively, even when there is no formal demand by the system. Every time I have implemented matmuls, at any level, with any optimization requirement, partitioning into dyadic blocks had always sped things up. So I try to feed the system the nicest numbers I can muster!
D&C approaches are applicable to lots of problems and, as you’ve noticed, tend to do well with “round” (in binary) numbers.
And if you can't make sure all of them are divisible by 2 often, at least pick the inner dimensions of your matmuls (if possible).