High-bandwidth flash progress and future

(blocksandfiles.com)

13 points | by tanelpoder 4 days ago ago

3 comments

The potential here with High-Bandwidth Flash is super cool. Effectively trying to go from 8 or a dozen flash channels to having a hundred or hundreds of channels would be amazing:

> The KAIST professor discussed an HBF unit having a capacity of 512 GB and a 1.638 TBps bandwidth.

One weird thing about this would be that it's still NAND flash and NAND flash still has limited read/write cycles, often measured in the thousands (Drive-Writes-a-Day across 5 years). If you can load a model & just keep querying it, that's not a problem. Maybe it's small enough to not be so bad, but my gut is that writing context here too might present difficulty.

[-]

digiown an hour ago

I assume the use case is that you are an inference provider, and you put a bunch of models you might want to serve in the HBF to be able to quickly swap them in and out on demand.

[-]

jauntywundrkind an hour ago

I think the hope is to run directly off of HBF directly, to eventually replace RAM with it entirely. 1.5TB/s is a pretty solid number! It's not going to be easy, it doesn't just drop in and replace (vastly bigger latency) but HBF replacing HBM for gobs of bandwidth is the intent, I believe.

Kioxia & Nvidia are already talking about 100M IOps SSD's directly attached to GPUs. This is less about running hte model & more about offboarding context for future use, but Nvidia is pushing KV cache to ssd. And using BlueField-4 which has PCIe on it to attach SSDs, process there. https://blocksandfiles.com/2025/09/15/kioxia-100-million-iop... https://blocksandfiles.com/2026/01/06/nvidia-standardizes-gp... https://developer.nvidia.com/blog/introducing-nvidia-bluefie...

We've already deepseek running straight off NVMe, weights runnig there. Slowly, but this maybe could scale. https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/deepsee...

Kioxia for example has AiSAQ, which works in a couple places such as Milvus; not 100% clear but me exactly what's going on there, but it's trying to push work to the NVMe. And with NVMe 2.1 having computational storage, I expect we see more pushing work to the SSD.

These aren't directly the same thing as HBF. A lot is caching, but also, I tend to think there is an aspiration of trying to move some work out of ram, not merely to be able to load into ram faster.