27 comments

  • v9v 24 minutes ago

    Somewhat relevant is a blog-post that likens attention to kernel smoothing: https://bactra.org/notebooks/nn-attention-and-transformers.h... (as discussed before in https://news.ycombinator.com/item?id=38756888)

  • amluto 7 hours ago

    Hint for authors: when discussing linear algebra (or really most other kinds of math), follow normal conventions. In this case, the convention would be that - (the minus sign) means subtraction. It does not mean "and also", especially when you sandwich it between two variables that represent matrices.

    I read the paper with much head scratching all the way through sections 1 and 2 and part of 3 before I figured out that, no, really, the description "Q-K=V" does not mean "Q minus K equals V" (the head scratching was because a bunch of their descriptions and symmetry comments really make little sense if you think "Q minus K equals V"). If you want to say that "K equals V", please spell it "K=V" :)

    I am curious whether it makes any sense at all to enforce a more general linear constraint on the query, key and value attention matrices along the line of Q-K=V.

    It is an entertaining paper. I admit I'm surprised that K=V appears to work as well as it does -- it seems like it's almost enforcing a sort of model where the query is a guess as to what the value is and the attention head returns a (softmaxed) value that is closest to the query's guess. Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

    • amemi 6 hours ago

      > Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

      In fact, on the second last page of the paper, they discuss this very problem. There is a clear correlation between performance and increasing sequence lengths for the Q-K=V model. While limited to a tight n=3 sample between 512, 1024, 2048 lengths, the degradation decreases from 5.4% to 2.2% as context is increased, suggesting that it is unlikely shorter sequences are the reason K=V performs acceptably.

    • xiaoyu2006 7 hours ago

      Yeah the weird notation confused me too. Their own Limitations also says their experiments are too small. I am quite curious how it will play out big now, but unironically I cannot afford the hardware lol.

    • kanbankaren 4 hours ago

      It confused me too.

      A n-tuple notation would have been more readable and mathematically accurate like (Q=K, V), (Q, K=V), and (Q=K=V).

    • ssivark an hour ago

      Would it have killed them to use a comma instead?!

    • canjobear 2 hours ago

      It’s not typeset in math mode so you can’t expect the hyphen to correspond to minus.

    • sfink 2 hours ago

      Wha? Why didn't they use Q=K=V for that?

    • semiinfinitely 4 hours ago

      Its not a math paper

      • volemo 2 hours ago

        Does it not being an English philology paper mean they are free to spell “fish” as “ghoti”?

      • srean an hour ago

        Definitely an applied maths paper given that it has been published under CS/ML and been accepted at ICML.

  • Lerc 6 hours ago

    I can see why the QKV gets used but I can't help but think that thete's got to be a better mechanism with turning a pair of vectors into a new vector and a significance field.

    Geometrically I imagine the process of attention like picking up a bunch of vectots and spinning and squishing them in many-D until you can find a crack where you can see all the way through, then leveraging that crack to seperate what you want.

    I doubt that's strictly accurate, but it might be close enough that it makes me think that if you were doing that with a bunch of bananas, it would be much easier to find the way through if you could also bend the bunch so they were all straight.

    It's always the trade off of a smart complex operation against an absolute crapload of dumb ones.

    • nullpoint420 an hour ago

      It kinda reminds me of general relativity and gravity bending space-time. I'm sure I sound nuts right now, but the model fits in my head.

  • foldl2022 4 hours ago

    Gemma-4 E2B/E4B models reuses K-V cache from other layers, which do things in a "transposed" way: not reuse Q/K/V matrices within a single layer, but reuse across different layers.

  • hollosi an hour ago

    I would not be surprised if it turned out the exact attention mechanism does not really matter, similarly to the sigmoid, ReLU, GELU movement, only the speed on calculation - and QKV is pretty good at that on the GPUs.

  • in-silico 7 hours ago

    These types of ablation studies are always good. However, I'm not sure how generalizable the language model findings here are.

    Their 1.2B model was trained on only 10B tokens, which is less than half of the chinchilla compute optimal number. Modern overtrained 1B LLMs are trained on the order of 10T tokens (1000x more).

    This is important because, from my own experience, simplifications and alternatives to standard attention can look fine in the under-trained regime but lag after over-training. This happens because attention has very little out-of-the-gate inductive bias, so it takes a lot of training for the expressiveness to really shine through.

    I can't fault the authors since longer training runs cost money, but it warrants pointing out.

    I'm also disappointed that they didn't report reasoning benchmark results for the Q=K-V case, since that is by far the most theoretically interesting case (in my eyes).

    • janalsncm 4 hours ago

      It’s a data point. I could imagine in a hardware constrained setting we might not care about training on enormous token counts, and on smaller devices it’s great if we can simplify the architecture.

      I agree that this isn’t proof that it scales to trillions of tokens, but this does show a scaled up experiment would be worth a shot.

      • Philpax 3 hours ago

        The Chinchilla scaling laws give you a minimum for the number of tokens you should be using for a given size: if you can't meet what they suggest for that size, you should shrink the size, as, otherwise, the capacity of the model is going to waste.

        I do agree that it is a datapoint, but GP's point is that this model was undertrained, so it's hard to draw the same conclusions from it that we would from other research.

    • ACCount37 4 hours ago

      I wonder if some of those synthetics that specifically burn in attention inductive bias could help there - i.e. by getting attention to converge faster than it normally would?

  • semessier 3 hours ago

    V being collinear is obvious, the question is/was also which additional orthogonal projections such as camera position for vision would improve the transformer.

  • xiaoyu2006 7 hours ago

    Will be great and amusing if it actually turns out that we have been doing transformer overly-complex. The code repo is missing tho...

    • ares623 7 hours ago

      Gets the juices flowing though..

  • dnnddidiej 3 hours ago

    No one got fired for choosing QKV I guess

  • jephs 6 hours ago

    I'm terribly sorry, but scaling curves or GTFO. Any random pile of linear algebra works fine-ish at small scales. Very few random piles of linear algebra push the Pareto envelope at large scales.

    • ketchup32613 5 hours ago

      Do you want to see scaling curves wrt data and param size? I agree that 1.2B and 10B tokens is not representative, but what scale of parameters and dataset sizes would be convincing?

      • zxexz 5 hours ago

        Not to sound facetious, but perhaps enough runs at different param/token sizings to define a curve?

  • 7e 5 hours ago

    More evidence that the original Transformer authors didn't really know what they were doing, but they did have access to more cheap compute than anyone else.