Hacking Diffusion into Qwen3 for the Arc Challenge

(matthewnewton.com)

107 points | by mattnewton 4 days ago ago

10 comments

  • radarsat1 6 hours ago

    Regarding the typewriter approach, I've wondered for a while if anyone has explored simple backtracking with LLMs? Like, have the LLM be able to generate a backspace/delete token that lets it "undo" previously generated tokens in an append-only fashion. Not sure how this would work with teacher forcing but seems feasible with RL.

    • dev_hugepages 22 minutes ago

      Research on the backspace token: https://arxiv.org/abs/2306.05426 > [...] The IL framework also allows us to incorporate backtracking by introducing a backspace action into the generation process [...]

    • _diyar an hour ago

      With current LLMs, is meaningless because the current state is stored in the "context" (system prompt, user prompt, chat output so far). So if you apply a backspace token, you just end up where you started a second ago.

      I.e. At state A, you have decided to append token i to move to state B. Removing token i just sets you back to state A, where you would again just pick token i. (Note that this is ignoring the fact that there's a small probabilistic component to next token selection).

      In the RL/reasoning world of LLMs, you can instead just reward correct final output without policing the reasoning steps, and a strong model should learn to backtrack on their "thoughts" as appropriate (without removing it from the context).

      Edit: wording.

  • mNovak 9 hours ago

    Really interesting to see the diffusion model solve the puzzles in an iterative way, which feels more similar to how I (and probably most humans) solve them.

    Outwardly, it seems to be limited by unmasking too few tokens per round, even when the heatmap shows many more high-confidence guesses available. On some of the larger puzzles it looks like it's wasting many rounds filling in the 'obvious' shapes, and then gets the interesting bit in the last round. It also doesn't seem to have learned the idea of "the background is blue with shapes drawn on top," where background is often 50% of the solution in these puzzles.

  • twotwotwo 8 hours ago

    It is kind of wild that most coding tasks are editing tasks, and we humans care a lot about code editing tools, but automated tools use code generation for editing where a valid block must be generated top-to-bottom in one go.

    Fixing a mistake requires re-generating the file or block of code. Or, if something generated later has implications for earlier code--a new import or function parameter's required, something like that--the only option is to go back and re-generate a big chunk. That'd be inefficient for humans, not implausible it's wrong for other code generators too.

    I don't know if diffusion specifically will be the approach. (Maybe there's something to generating edit sequences?) This post's note that diffusion kills KV caching is something I hadn't even considered. It does seem right to experiment with things other than strict start-to-end generation.

    • namibj 5 hours ago

      You can still cache prompts; this just affects the cache for during generation produced tokens. And that's fairly harmless relatively speaking.

      • yorwba 4 hours ago

        If you completely do away with autoregression, prompt tokens can pay attention to generated tokens, so even the prompt tokens' KV vectors change at every step and you cannot cache anything.

        For this reason, models that generate text using diffusion typically generate blocks of tokens at a time, where tokens within a block freely attend to each other, but across blocks there's causal masking so that each block only depends on the preceding ones and we're back to autoregression again. That makes caching possible, but also means you still can't have diffusion change the beginning of a long text to match the end.

    • imtringued 6 hours ago

      A massive problem with current generation llms is that they have a single globally ordered context and that the model is only allowed to append to the context.

      This is like having a single tape Turing machine. They can simulate a multi tape machine, but at O(n^2) complexity.

      The computation budget of an LLM is finite, so this has a massive practical impact.

      • cubefox an hour ago

        The article explains that this is not a problem but an advantage.

  • gen3 13 hours ago

    Incredibly cool work, and a great primer on diffusion