Decoupling Compute and Memory for Async GPUs

7 points | by yiyingzhang 14 hours ago ago

2 comments

bobbyzhu2008 13 hours ago

67% less kernel code is the more interesting number here — Hopper's async capabilities have been underutilized largely because the programming model is painful. Curious how it handles cases where compute and memory phases aren't cleanly separable.

jhap 11 hours ago

This seems like a better version of CUDA, for Hopper GPUs?