Show HN: Continuous Nvidia CUDA PC Sampling Profiler

(polarsignals.com)

12 points | by gnurizen 4 days ago ago

4 comments

Honest question, I feel like kernels are usually short enough that you can fully understand their performance in the development cycle before you even deploy them. If you get different results in production this seems to me that you didn’t spend enough time understanding what’s going on earlier. Are there things you genuinely can’t get from this workflow?

[-]

SyzygyRhythm 16 minutes ago

Sometimes you have to optimize other people's code. Also, sometimes code behaves unexpectedly depending on the data, say over a certain size threshold. And sometimes it behaves differently on different hardware. You don't always find these things out until production.

killamdiaz 4 days ago

Very cool project.

Curious whether the biggest value has been performance debugging itself or helping developers understand system behavior they otherwise wouldn't have visibility into.

Sometimes the observability layer ends up being more valuable than the optimization layer.

[-]

gnurizen 4 days ago

Thanks! I think most performance debugging happens during development, what we're bringing to the table is exposure of system behavior in production which often diverges because of changes in the shape of workloads from dev, which are often simplistic and synthetic. So I'd say its late-stage performance debugging and production observability combined that makes this useful.

Stay tuned for a follow on post where we show how we used this to optimize an FSST decompression kernel for vortex (https://github.com/vortex-data/vortex).