The KV cache is order dependent and dependent on the context of tokens which exist before the KV cache.
There are some transformation approaches to re-use the kv cache across inferences, but none are in wide use due to accuracy concerns following the transformation.
The paper has a section on "Reusing precomputed KV across queries" which talks about how other papers have tried to address this problem, but yeah, this paper adds nothing on its own besides a catchy title.
This paper doesn't make any sense - for background, I've maintained an AI client that's cross-platform, cross-provider, and integrates llama.cpp since 2022. I don't know why they think "agents" don't share prefill work - paid providers cache on the prefill text, llama.cpp, same, and I specifically hooked up llama.cpp so it can do subsets as well. i.e. all the agents would reuse the cache
It reads like it started from an underspecification of "agents" x a strain of pop-wisdom about "KV cache" that I've seen enter mainstream discourse over the past 3 months that is Not Even Wrong, then, they solved a non-existent problem.
EDIT: based on the rest of comments either requesting a primer on terms, or, pointing out it makes errors in even more obvious ways, flagging.
I don't think Luoyuan Zhang is necessarily doing this, but I'm pretty sure lots of people are using arxiv as a glorified blog and hoping no one notices.
The KV cache is order dependent and dependent on the context of tokens which exist before the KV cache.
There are some transformation approaches to re-use the kv cache across inferences, but none are in wide use due to accuracy concerns following the transformation.
The paper has a section on "Reusing precomputed KV across queries" which talks about how other papers have tried to address this problem, but yeah, this paper adds nothing on its own besides a catchy title.
Just curious, do you have links to read more about transformations or other techniques for KV cache reuse?
All major model providers offer prefix caching, which is this.
No, reusing segments of the kv cache for different purposes in an order independent manner is an active research area.
Any keyword or paper I can search for?
> Then the part that matters: where the KV lives
When your abstract was clearly generated by an LLM and not curated to at least make it sound human, it does not make me want to read your paper.
Seems Cloudflare is now doing this for scraping, so makes sense to continue down the pipeline!
This paper doesn't make any sense - for background, I've maintained an AI client that's cross-platform, cross-provider, and integrates llama.cpp since 2022. I don't know why they think "agents" don't share prefill work - paid providers cache on the prefill text, llama.cpp, same, and I specifically hooked up llama.cpp so it can do subsets as well. i.e. all the agents would reuse the cache
It reads like it started from an underspecification of "agents" x a strain of pop-wisdom about "KV cache" that I've seen enter mainstream discourse over the past 3 months that is Not Even Wrong, then, they solved a non-existent problem.
EDIT: based on the rest of comments either requesting a primer on terms, or, pointing out it makes errors in even more obvious ways, flagging.
I don't think Luoyuan Zhang is necessarily doing this, but I'm pretty sure lots of people are using arxiv as a glorified blog and hoping no one notices.
Does anyone have a good recommendation for explaining or as a primer on KV cache?
convert this question to KV cache and give it to your agent
A truly global singleton
Lambda computing for prompts?