Smart call on the tiered lookup, hitting SQLite first and falling back to FLOPs/TFLOPS estimation. One thing I'm wondering about the 20% overhead in Tier 2, does that factor in framework overhead or just raw model weights? That margin can vary a lot depending on whether you're running PyTorch vs ONNX.
The 20% is a safety margin on the memory fit check only. it sits on top of the raw weights-only figure (params × bytes-per-precision) to account for KV cache and activation tensors, not framework differences specifically.
Your point is valid but i think it applies to a different layer. PyTorch vs ONNX overhead is real, but it's implicitly captured in the throughput path. Tier 2 scales from real-world benchmarks that already reflect whatever framework ran them. The 20% is intentionally conservative: it'll occasionally say a model won't fit when it technically could, but it won't tell you something fits and then OOM you.
Smart call on the tiered lookup, hitting SQLite first and falling back to FLOPs/TFLOPS estimation. One thing I'm wondering about the 20% overhead in Tier 2, does that factor in framework overhead or just raw model weights? That margin can vary a lot depending on whether you're running PyTorch vs ONNX.
The 20% is a safety margin on the memory fit check only. it sits on top of the raw weights-only figure (params × bytes-per-precision) to account for KV cache and activation tensors, not framework differences specifically. Your point is valid but i think it applies to a different layer. PyTorch vs ONNX overhead is real, but it's implicitly captured in the throughput path. Tier 2 scales from real-world benchmarks that already reflect whatever framework ran them. The 20% is intentionally conservative: it'll occasionally say a model won't fit when it technically could, but it won't tell you something fits and then OOM you.