Forcing Flash Attention onto a TPU and Learning the Hard Way

(archerzhang.me)

46 points | by azhng 5 days ago ago

11 comments

  • ColonelPhantom an hour ago

    Interesting read! One remark though: I'm not too familiar with the architecture of a Google TPU, but comparing the TPU's VMEM with Nvidia's shared memory feels wrong to me.

    Looking at the size, and its shared nature, it feels far more natural to compare with the L2 cache, which is also shared across the entire GPU and is in the same order of size (40MB on the listed A100).

  • FL33TW00D 3 hours ago

    Why ruin good work by letting Claude write it all? Full of em dashes, riddled with Claudisms.

    • gdiamos 2 hours ago

      I personally don't mind letting Claude write about work.

      You could spend 80% doing the work and 20% writing about it, or 99% doing the work and 1% copy-pasting Claude's writeup about it into a blog.

      There is nothing wrong with writing if you are into it, and yes you can probably do better than Claude, but I can related to engineers who just want to build.

      • spzb 2 hours ago

        If you can’t be bothered to write it, why should I bother to read it?

        • cannonpr 24 minutes ago

          Because it contains information of value to you ? I mean if it doesn’t, just don’t read it.

      • selfhoster11 2 hours ago

        I could spend 100% doing the work with my own Claude, and 0% reading yours. That's a negative-sum outcome. I do think that the 80%/20% split is better (though anything that is mostly human voice is fine for me).

      • Groxx an hour ago

        Because the failures are so frequent and often load-bearing that it makes it a negative sum to even attempt to read stuff that appears generated.

    • skybrian an hour ago

      Why let an obsession with writing style prevent you from learning from a reasonably decent writeup?

  • gdiamos 4 hours ago

    One of my lessons in using different accelerators, whether they be different NVIDIA versions, or GPU->TPU, etc is that someone needs to do this work of indexing, partitioning, mapping, scheduling, and benchmarking. That work is labor intensive.

    In this case, google has already done it, and that will be true for high resourced accelerator companies like Google working with the most popular operations like attention.

    As long as you use those operations, you are okay. But if you do something different, you need to be prepared to do all of this yourself.

  • refulgentis 4 hours ago

    It broke my heart to have a visceral "I'm being slop'd" reaction reading this: it's such good work, and AI's barely used AFAICT, but there's enough odd transitions and copy-pasta'd markdown that you get the subconcious "this is AI" reaction regardless.

    Many sentences are 3x as long as it normally would be in subtle ways (to wit: "My flash attention is 35x slower than the fused standard at n=4096. Not a little worse. Catastrophically worse."), it really wears on attention. (pun intended) It brings literary voice to a technical blog post, and a very difficult process-oriented technical blog post. I have to reallocate my unfortunately-limited brain cells from "maintaining state of where we are in the process" to "is this cutesy fluff or important" and I've never had to do that in 37 years with technical blog posts.

    The Markdown gets bad. Bolding is used for important phrases (like a human would), then, all of a sudden, after the "Inside a TPU chip" header its being used every other sentence, on anything that is a proper noun/would have a Wikipedia article. It got so weird that at some point I was like "a human definitely didn't let this through...they must be links?" and tried clicking them.

    It's doubly bad at that point, because markdown tables start coming in hot and heavy too. So you're left with "It's pretty apparent the LLM did it from here, and I can't keep trying to keep the state of the process in my head while trying to figure out if the bolding is important, reflexive close tab

    • jacquesm 2 hours ago

      You got a lot further than I did.