ThunderKittens: Simple, Fast, and Adorable AI Kernels

(hazyresearch.stanford.edu)

40 points | by lnyan 5 hours ago ago

5 comments

  • pama 29 minutes ago

    I dont want to use the Platform Formerly Known as Twitter, but does anyone have a way to get the link to their livestream tomorrow?

  • danielhanchen an hour ago

    This is super cool! Especially matrix mult getting similar or better perf than cuBLAS! If anyone is interested on other kernels like swiglu, geglu, RMS layernorm, I coded some at https://github.com/unslothai/unsloth/tree/main/unsloth/kerne...

  • mynameismon 2 hours ago

    How easy is it to run on older GPUs (think 1080Tis)? The reason I ask this is because torch.compile refuses to support that, and that alone makes things much slower.

    • danielhanchen an hour ago

      The other issue is Pascal cards don't have tensor cores, so there much slower than those with them. You could try Unsloth for 2x faster llama fine-tuning - someone made P40s and P100s work. Although I would suggest upgrading to at least RTX 20x series.

    • almostgotcaught 2 hours ago

      > torch.compile

      torch.compile is a pt2.0 feature and has nothing to do with handwritten cuda kernels

      > How easy is it to run on older GPUs

      this is a torch cpp extension

      https://github.com/HazyResearch/ThunderKittens/blob/8daffc9c...

      so you're going to have the same exact issue (whatever issue you're having)