Calculating GPT-2's Inference Speedups

(njkumar.com)

2 points | by njkumarr 14 hours ago ago

2 comments

  • p1esk 13 hours ago

    Good post, thank you!

    On an A100 80GB we get 312 teraflops per second of float16 compute and 1.5 TB/s of memory bandwidth, and this ratio comes out to roughly 208 tokens.

    Few thoughts:

    1. One token != one byte

    2. Your prompt ("Edgar Allan Poe is a”) is short (<<300 tokens)

    3. Both flops and memory bandwidth for A100 are theoretical maximums. Reality is usually very different and is workload dependent.

    • njkumarr 7 hours ago

      Thank you for taking the time to read my article!

      For your 2nd point, to clarify I actually generate 300 new tokens on top of that initial prompt, not just using the short prompt, so with precomputation of the prompt + token generation it should come out to about 306 tokens.

      For your 1st and 3rd point you are definitely correct, looking back, I should've focused probably on using the torch profiler to track what point my CPU overhead started to decrease in order to assess compute-bound regions in my workflow better, rather than napkin math on A100 specs.