Hybrid Attention

35 points | by JohannaAlmeida 7 hours ago ago

11 comments

I've been interested in faster attention and smaller models for some time but haven't had the time to do serious research so I can't answer your questions.

However, everything you do sounds very interesting, useful and well thought out, please keep doing it, I'd encourage others to work in the same direction too.

I hope, more of us can find the time for more than best wishes in the near future.

hackerman70000 3 hours ago

For the evaluation question: for small code models, try-to-compile rate on generated functions is the simplest metric that actually correlates with usefulness. Perplexity tells you the model learned the distribution, compilation rate tells you it learned the structure. Beyond that, exact match on function body completion given a signature is more informative than open ended generation benchmarks

JohannaAlmeida 6 hours ago

Full attention O(n²): 17.96s / 5.6 tok/s

HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s

empath75 6 hours ago

Is this for just like auto complete, because you are not going to get anything very useful out of a code-only training set.

[-]

JohannaAlmeida 5 hours ago

Yeah auto complete is an amazing use case. I needed a small model that used transformers , could fit on my weak consumer GPU .

So i needed to make fundamental arquitecture changes .Do some KV cache tricks.

And then prove the new arquitecture was faster with benchmarks and perplexity was acceptable.

bigbadfeline 2 hours ago

Well, coding is a kind of extended autocomplete - I prefer that way of working because I don't like the mess created by LLMs when you let them work on their own. Smaller models, specialized on a single language, make a lot of sense.

altruios 5 hours ago

I think it's more a proof of concept: locally trained. It would take lots of resources/time to train something non-trivial.

woodson 5 hours ago

Look into RWKV.

[-]

JohannaAlmeida 5 hours ago

Yeah RWKV is definitely related in spirit (recurrent state for long context). Here I’m combining local windowed attention with a gated recurrent path + KV cache compression, so it’s more hybrid than fully replacing attention

MarcelinoGMX3C 4 hours ago

[dead]

Aegis_Labs 5 hours ago

[dead]