For your 2nd point, to clarify I actually generate 300 new tokens on top of that initial prompt, not just using the short prompt, so with precomputation of the prompt + token generation it should come out to about 306 tokens.
For your 1st and 3rd point you are definitely correct, looking back, I should've focused probably on using the torch profiler to track what point my CPU overhead started to decrease in order to assess compute-bound regions in my workflow better, rather than napkin math on A100 specs.
Good post, thank you!
On an A100 80GB we get 312 teraflops per second of float16 compute and 1.5 TB/s of memory bandwidth, and this ratio comes out to roughly 208 tokens.
Few thoughts:
1. One token != one byte
2. Your prompt ("Edgar Allan Poe is a”) is short (<<300 tokens)
3. Both flops and memory bandwidth for A100 are theoretical maximums. Reality is usually very different and is workload dependent.
Thank you for taking the time to read my article!
For your 2nd point, to clarify I actually generate 300 new tokens on top of that initial prompt, not just using the short prompt, so with precomputation of the prompt + token generation it should come out to about 306 tokens.
For your 1st and 3rd point you are definitely correct, looking back, I should've focused probably on using the torch profiler to track what point my CPU overhead started to decrease in order to assess compute-bound regions in my workflow better, rather than napkin math on A100 specs.