GPU Prefix Sums: A nearly complete collection

(github.com)

63 points | by coffeeaddict1 8 hours ago ago

13 comments

This looks amazing, I've been shopping for an implementation of this I could play around with for a while now

They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?

coffeeaddict1 6 hours ago

Related paper by the authors: https://dl.acm.org/doi/10.1145/3694906.3743326

[-]

dang an hour ago

We'll put that link in the top text too. Thanks!

m-schuetz 4 hours ago

That and https://github.com/b0nes164/GPUSorting have been a tremendous help for me, since CUB does not nicely work with the Cuda Driver Api. The author is doing amazing work.

genpfault 7 hours ago

https://en.wikipedia.org/wiki/Prefix_sum#Applications

[-]

almostgotcaught 7 hours ago

this is missing the most important one (in today's world): extracting non-zero elements from a sparse vector/matrix

https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

[-]

merope14 6 hours ago

Not even close. The most important application (in today's world) is radix sort.

[-]

WJW 4 hours ago

What specific application do you have in mind that radix sort is more important than matrix multiplication?

[-]

m-schuetz 3 hours ago

Is that relevant for 4x4 multiplications? Because at least for me, radix sort is way more important than multiplying matrices beyond 4x4. E.g. for Gaussian Splatting.

otherjason 3 hours ago

I think they were trying to say “radix sort is a more important application of prefix sum than extraction of values from a sparse matrix/vector is.”

[-]

WJW 2 hours ago

I understand what GP meant, but extraction of values from a sparse matrix is an essential operation of multiplying two sparse matrices. Sparse matmult in turn is an absolutely fundamental operation in everything from weather forecasting to logistics planning to electric grid control to training LLMs. Radix sort on the other hand is very nice but (as far as I know) not nearly used as widely. Matrix multiplication is just super fundamental to the modern world.

I would love to be enlightened about some real-world applications of radix sort I may have missed though, since it's a cool algorithm. Hence my question above.

woadwarrior01 3 hours ago

Top K sampling comes to mind, although it's nowhere nearly as important as matmult.

[-]

almostgotcaught 3 hours ago

ranking models benefit from gpu impls of sort but yup they're not nearly as common/important as spmm/spmv