BERT Is Just a Single Text Diffusion Step

(nathan.rs)

235 points | by nathan-barry 4 hours ago ago

49 comments

  • kibwen 3 hours ago

    To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

    • sailingparrot 2 hours ago

      > the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words

      Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.

      So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.

      • wizzwizz4 an hour ago

        If a process is necessary for performing a task, (sufficiently-large) neural networks trained on that task will approximate that process. That doesn't mean they're doing it anything resembling efficiently, or that a different architecture / algorithm wouldn't produce a better result.

        • jama211 31 minutes ago

          It also doesn’t mean they’re doing it inefficiently.

          • pinkmuffinere 17 minutes ago

            I read this to mean “just because the process doesn’t match the problem, that doesn’t mean it’s inefficient”. But I think it does mean that. I expect we intuitively know that data structures which match the structure of a problem are more efficient than those that don’t. I think the same thing applies here.

            I realize my argument is hand wavey, i haven’t defined “efficient“ (in space? Time? Energy?), and there are other shortcomings, but I feel this is “good enough” to be convincing

    • tripplyons 2 hours ago

      Here's a blog post I liked that explains a connection: https://sander.ai/2024/09/02/spectral-autoregression.html

      They call diffusion a form of "spectral autoregression", because it tends to first predict lower frequency features, and later predict higher frequency features.

    • cube2222 2 hours ago

      I will very often write a message on slack, only to then edit it 5 times… Now I always feel like a diffusion model when I do that.

    • crubier 2 hours ago

      You 100% do pronounce or write words one at a time sequentially.

      But before starting your sentence, you internally formulate the gist of the sentence you're going to say.

      Which is exactly what happens in LLMs latent space too before they start outputting the first token.

      • taeric 2 hours ago

        I'm curious what makes you so confident on this? I confess I expect that people are often far more cognizant of the last thing that the they want to say when they start?

        I don't think you do a random walk through the words of a sentence as you conceive it. But it is hard not to think people don't center themes and moods in their mind as they compose their thoughts into sentences.

        Similarly, have you ever looked into how actors learn their lines? It is often in a way that is a lot closer to a diffusion than token at a time.

        • bee_rider 21 minutes ago

          It must be the case that some smart people have studied how we think, right?

          The first person experience of having a thought, to me, feels like I have the whole thought in my head, and then I imagine expressing it to somebody one word at a time. But it really feels like I’m reading out the existing thought.

          Then, if I’m thinking hard, I go around a bit and argue against the thought that was expressed in my head (either because it is not a perfect representation of the actual underlying thought, or maybe because it turns out that thought was incorrect once I expressed it sequentially).

          At least that’s what I think thinking feels like. But, I am just a guy thinking about my brain. Surely philosophers of the mind or something have queried this stuff with more rigor.

        • CaptainOfCoit 2 hours ago

          I think there is a wide range of ways to "turn something in the head into words", and sometimes you use the "this is the final point, work towards it" approach and sometimes you use the "not sure what will happen, lets just start talking and go wherever". Different approaches have different tradeoffs, and of course different people have different defaults.

          I can confess to not always knowing where I'll end up when I start talking. Similarly, not every time I open my mouth it's just to start but sometimes I do have a goal and conclusion.

        • btown 2 hours ago

          > far more cognizant of the last thing that the they want to say when they start

          This can be captured by generating reasoning tokens (outputting some representation the desired conclusion in token form, then using it as context for the actual tokens), or even by an intermediate layer of a model not using reasoning.

          If a certain set of nodes are strong contributors to generate the concluding sentence, and they remain strong throughout all generated tokens, who's to say if those nodes weren't capturing a latent representation of the "crux" of the answer before any tokens were generated?

          (This is also in the context of the LLM being able to use long-range attention to not need to encode in full detail what it "wants to say" - just the parts of the original input text that it is focusing on over time.)

          Of course, this doesn't mean that this is the optimal way to build coherent and well-reasoned answers, nor have we found an architecture that allows us to reliably understand what is going on! But the mechanics for what you describe certainly can arise in non-diffusion LLM architectures.

        • Workaccount2 an hour ago

          People don't come up with things their brain does.

          Words rise from an abyss and are served to you, you have zero insight into their formation. If I tell you to think of an animal, one just appears in your "context", how it got there is unknown.

          So really there is no argument to be made, because we still don't mechanistically understand how the brain works.

          • aeonik 34 minutes ago

            We don't know exactly how consciousness works in the human brain, but we know way more than "comes from the abyss".

            When I read that text, something like this happens:

            Visual perception of text (V1, VWFA) → Linguistic comprehension (Angular & Temporal Language Areas) → Semantic activation (Temporal + Hippocampal Network) → Competitive attractor stabilization (Prefrontal & Cingulate) → Top-down visual reactivation (Occipital & Fusiform) → Conscious imagery (Prefrontal–Parietal–Thalamic Loop).

            and you can find experts in each of those areas who understand the specifics a lot more.

            • giardini 8 minutes ago

              aeonik says >"We don't know exactly how consciousness works in the human brain, but we know way more than "comes from the abyss"."<

              You are undoubtedly technically correct, but I prefer the simplicity, purity and ease-of-use of the abysmal model, especially in comparison with other similar competing models, such as the below-discussed "tarpit" model.

        • jrowen an hour ago

          They're speaking literally. When talking to someone (or writing), you ultimately say the words in order (edits or corrections notwithstanding). If you look at the gifs of how the text is generated - I don't know of anyone that has ever written like that. Literally writing disconnected individual words of the actual draft ("during," "and," "the") in the middle of a sentence and then coming back and filling in the rest. Even speaking like that would be incredibly difficult.

          Which is not to say that it's wrong or a bad approach. And I get why people are feeling a connection to the "diffusive" style. But, at the end of the day, all of these methods do build as their ultimate goal a coherent sequence of words that follow one after the other. It's just a difference of how much insight you have into the process.

        • refulgentis 2 hours ago

          It's just too far of an analogy, it starts in the familiar SWE tarpit of human brain = lim(n matmuls) as n => infinity.

          Then, glorifies wrestling in said tarpit: how do people actually compose sentences? Is an LLM thinking or writing? Can you look into how actors memorize lines before responding?

          Error beyond the tarpit is, these are all ineffable questions that assume a singular answer to an underspecified question across many bags of sentient meat.

          Taking a step back to the start, we're wondering:

          Do LLMs plan for token N + X, while purely working to output token N?

          TL;DR: yes.

          via https://www.anthropic.com/research/tracing-thoughts-language....

          Clear quick example they have is, ask it to write a poem, get state at end of line 1, scramble the feature that looks ahead to end of line 2's rhyme.

          • jsrozner an hour ago

            Let's just not call it planning.

            In order to model poetry autoregressively, you're going to need a variable that captures rhyme scheme. At the point where you've ended the first line, the model needs to keep track of the rhyme that was used, just like it does for something like coreference resolution.

            I don't think that the mentioned paper shows that the model engages in a preplanning phase in which it plans the rhyme that will come. In fact such would be impossible. Model state is present only in so-far-generated text. It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable. (Now yes, as you increase the posterior probability of 'being in a poem' given context so far, you would expect that you also increase the probability of the rhyme-scheme variable's existing.)

            • refulgentis an hour ago

              I’m confused: the blog shows they A) predict the end of line 2 using the state at the end of line 1 and B) can choose the end of line 2 by altering state at end of line 1.

              Might I trouble you for help getting from there to “such would be impossible”, where such is “the model…plans the rhyme to come”

              Edit: I’m surprised to be at -2 for this. I am representing the contents of the post accurately. Its unintuitive for sure, but, it’s the case.

      • smokel 2 hours ago

        For most serious texts I start with a tree outline, before I engage my literary skills.

      • froobius 2 hours ago

        (Just to expand on that, it's true not just the for the first token. There's a lot of computation, including potentially planning ahead, before each token outputted.)

        That's why saying "it's just predicting the next word", is a misguided take.

      • pessimizer 2 hours ago

        Like most people I jump back and forth when I speak, disclaiming, correcting, and appending to previous utterances. I do this even more when I write, eradicating entire sentences and even the ideas they contain, within paragraphs that which by the time they were finished the sentence seemed unnecessary or inconsistent.

        I did it multiple times while writing this comment, and it is only four sentences. The previous sentence once said "two sentences," and after I added this statement it was changed to "four sentences."

      • NoMoreNicksLeft an hour ago

        >You 100% do pronounce or write words one at a time sequentially.

        It's statements like these that make me wonder if I am the same species as everyone else. Quite often, I've picked adjectives and idioms first, and then fill in around them to form sentences. Often because there is some pun or wordplay, or just something that has a nice ring to it, and I want to lead my words in that direction. If you're only choosing them one at a time and sequentially, have you ever considered that you might just be a dimwit?

        It's not like you don't see this happening all around you in others. Sure you can't read minds, but have you never once watched someone copyedit something they've written, where they move phrases and sentences around, where they switch out words for synonyms, and so on? There are at least dozens of fictional scenes in popular media, you must have seen one. You have to have noticed hints at some point in your life that this occurs. Please. Just tell me that you spoke hastily to score internet argument points, and that you don't believe this thing you've said.

    • ma2rten 2 hours ago

      Interpretability research has found that Autoregressive LLMs also plan ahead what they are going to say.

      • thamer 2 hours ago

        The March 2025 blog post by Anthropic titled "Tracing the thoughts of a large language model"[1] is a great introduction to this research, showing how their language model activates features representing concepts that will eventually get connected at some later point as the output tokens are produced.

        The associated paper[2] goes into a lot more detail, and includes interactive features that help illustrate how the model "thinks" ahead of time.

        [1] https://www.anthropic.com/research/tracing-thoughts-language...

        [2] https://transformer-circuits.pub/2025/attribution-graphs/bio...

      • aidenn0 2 hours ago

        This seems likely just from the simple fact that they can reliably generate contextually correct sentences in e.g. German Imperfekt.

    • aabhay 2 hours ago

      The fact that you’re cognitively aware is evidence that this is nowhere near diffusion. More like rumination or thinking tokens, if we absolutely had to find a present day LLM metaphor

    • silveraxe93 2 hours ago

      That's why I'm very excited by Gemini diffusion[1].

      - [1] https://deepmind.google/models/gemini-diffusion/

    • HPsquared 2 hours ago

      Maybe it's two different modes of thinking. I can have thoughts that coalesce from the ether, but also sometimes string a thought together linearly. Brains might be able to do both.

    • dudu24 2 hours ago

      That is not contrary to token-at-a-time approach.

    • EGreg 2 hours ago

      I feel completely the opposite way.

      When you speak or do anything, you focus on what you’re going do next. Your next action. And at that moment you are relying on your recent memory, and things you have put in place while doing the overall activity (context).

      In fact what’s actually missing from AI currently is simultaneous collaboration, like a group of people interacting — it is very 1 on 1 for now. Like human conversations.

      Diffusion is like looking at a cloud and trying to find a pattern.

  • thatguysaguy 2 hours ago

    Back when BERT came out, everyone was trying to get it to generate text. These attempts generally didn't work, here's one for reference though: https://arxiv.org/abs/1902.04094

    This doesn't have an explicit diffusion tie in, but Savinov et al. at DeepMind figured out that doing two steps at training time and randomizing the masking probability is enough to get it to work reasonably well.

  • jaaustin 3 hours ago

    To my knowledge this connection was first noted in 2021 in https://arxiv.org/abs/2107.03006 (page 5). We wanted to do text diffusion where you’d corrupt words to semantically similar words (like “quick brown fox” -> “speedy black dog”) but kept finding that masking was easier for the model to uncover. Historically this goes back even further to https://arxiv.org/abs/1904.09324, which made a generative MLM without framing it in diffusion math.

  • blurbleblurble 23 minutes ago

    I'm more excited about approaches like this one:

    https://openreview.net/forum?id=c05qIG1Z2B

    They're doing continuous latent diffusion combined with autoregressive transformer-based text generation. The autoencoder and transformer are (or can be) trained in tandem.

  • briandw 2 hours ago

    I love seeing these simple experiments. Easy to read through quickly and understand a bit more of the principles.

    One of my stumbling blocks with text diffusers is that ideally you wouldn’t treat the tokens as discrete but rather probably fields. Image diffusers have the natural property that a pixel is a continuous value. You can smoothly transition from one color to another. Not so with tokens. In this case they just do a full replacement. You can’t add noise to a token, you have to work in the embedding space. But how can you train embeddings directly? I found a bunch of different approaches that have been tried but they are all much more complicated than the image based diffusion process.

  • notsylver an hour ago

    I've really wanted to fine tune an inline code completion model to see if I could get at all close to cursor (I can't, but it would be fun), but as far as I know there are no open diffusion models to use as a base, and especially not any that would be good as a base. Hopefully something comes out soon that is viable for it

  • alansaber 3 hours ago

    Interested in how this compares to electra

    • breadislove 3 hours ago

      or deberta but nevertheless super interesting!

  • zaptrem 3 hours ago

    When text diffusion models started popping up I thought the same thing as this guy (“wait, this is just MLM”) though I was thinking more MaskGIT. The only thing I could think of that would make it “diffusion” is if the model had to learn to replace incorrect tokens with correct ones (since continuous diffusion’s big thing is noise resistance). I don’t think anyone has done this because it’s hard to come up with good incorrect tokens.

  • BoiledCabbage 2 hours ago

    To me part of the appeal of image diffusion models was starting with random noise to produce an image. Why do text diffudion models start with a blank slate (ie all "masked" tokens), instead of with random tokens?

    • ttul an hour ago

      It depends on what you want the model to do for you. If you want the model to complete text, then you would provide the input text unmasked followed by a number of masked tokens that it's the model's job to fill in. Perhaps your goal is to have the model simply make edits to a bit of code. In that case, you'd mask out the part that it's supposed to edit and the model would iteratively fill in those masked tokens with generated tokens.

      One of the powerful abilities of text diffusion models is supposedly in coding. Auto-regressive LLMs don't inherently come with the ability to edit. They can generate instructions that another system interprets as editing commands. Being able to literally unmask the parts you want to edit is a pretty powerful paradigm that could improve if not just speed up many coding tasks.

      I suspect that elements of text diffusion will be baked into coding models like GPT Codex (if they aren't already). There's no reason you could not train a diffusion output head specifically designed for code editing and the same model is able to make use of that head when it makes the most sense to do so.

    • didibus 2 hours ago

      They don't all do that. There's many approaches being experimented on.

      Some start with random tokens, or with masks, others even start with random vector embeddings.

  • schopra909 3 hours ago

    Very cool parallel. Never thought about it this way — but makes complete sense

  • nodja an hour ago

    I think another easy improvement to this diffusion model would be for the logprobs to also affect the chance of a token being turned into a mask. So higher confidence tokens should have less of a chance to be pruned, should converge faster. I wonder if backprop would be able exploit that. (I'm not an ML engineer).

  • rafaelero 2 hours ago

    The problem with this approach to text generation is that it's still not flexible enough. If during inference the model changes its mind and wants to output something considerably different it can't because there are too many tokens already in place.

    • nodja an hour ago

      That's not true, you could just have looked at the first gif animation in the OP and seen that tokens disappear, the only part that stays untouched is the prompt, adding noise is part of the diffusion process and the code that does it is even posted in the article (ctrl+f "def diffusion_collator").

    • didibus 2 hours ago

      Could maybe be solved by reintroducing noise steps in between denoising step?

  • skeptrune 3 hours ago

    Fun writeup! It's amazing how flexible an architecture can be to different objectives.