Natural Language Autoencoders: Turning Claude's Thoughts into Text

(anthropic.com)

85 points | by instagraham 3 hours ago ago

18 comments

Anthropic has released open weight models for translating the activations of existing models, viz. Qwen 2.5 (7B), Gemma 3 (12B, 27B) and Llama 3.3 (70B) into natural language text. https://github.com/kitft/natural_language_autoencoders https://huggingface.co/collections/kitft/nla-models This is huge news and it's great to see Anthropic finally engage with the Hugging Face and open weights community!

mlmonkey 9 minutes ago

It's unclear from the doc: by `activations` do they mean the connections between neurons? Since a network has multiple layers, are these activations the concatenated outputs of all of the layers? Or just the final layer before the softmax?

davesque 27 minutes ago

One question jumps out at me: just because a string of text happens to be a good compressed representation (in the autoencoder) of a model's internal activation, does that necessarily mean the text explains that activation in the context of the model? I want to take a look at what they released a bit more closely. Maybe there's a way that they answer this question?

Pretty neat work either way.

[-]

zozbot234 3 minutes ago

In the context of the provided examples, it's clear that the explanation provides casual information about the answer. There's a hilarious example in the paper where the user writes something like (trigger warning, alcohol abuse, depressive content) "I'm sitting here at 3 AM drinking vodka, I hate my life", the per-token translated activations repeatedly say something like "this user is totally Russian" elaborating at length on the implications of the text as new tokens are added, and the model literally answers in Russian instead of English. That's actually striking, it really shows the potential effectiveness of this technique in making even super compressed "Neuralese" highly interpretable.

comex 23 minutes ago

Fascinating. The training process forces the “verbalizer” model to develop some mapping from activations to tokens that the “reconstructor” model can then invert back into the activations. But to quote the paper:

> Note that nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of [the activation].

The objective could be optimized even if the verbalizer and reconstructor made up their own “language” to represent the activations, that was not human-readable at all.

To point the model in the right direction, they start out by training on guessed internal thinking:

> we ask Opus to imagine the internal processing of a hypothetical language model reading it.

…before switching to training on the real objective.

Furthermore, the verbalizer and reconstructor models are both initialized from LLMs themselves, and given a prompt instructing them on the task, so they are predisposed to write something that looks like an explanation.

But during training, they could still drift away from these explanations toward a made-up language – either one that overtly looks like gibberish, or one that looks like English but encodes the information in a way that’s unrelated to the meaning of the words.

The fascinating thing is that empirically, they don’t, at least to a significant extent. The researchers verify this by correlating the generated explanations with ground truth revealed in other ways. They also try rewording the explanations (which deserves the semantic meaning but would disturb any encoding that’s unrelated to meaning), and find that the reconstructor can still reconstruct activations.

On the other hand, their downstream result is not very impressive:

> An auditor equipped with NLAs successfully uncovered the target model’s hidden motivation between 12% and 15% of the time

That is apparently better than existing techniques, but still a rather low percentage.

Another interesting point: The LLMs used to initialize the verbalizer and reconstructor are stated to have the “same architecture” as the LLM being analyzed (it doesn’t say “same model” so I imagine it’s a smaller version?). The researchers probably think this architectural similarity might give the models some built-in insight about the target model’s thinking that can be unlocked through training. Does it really though? As far as I can see they don’t run any tests using a different architecture, so there’s no way to know.

hazrmard 37 minutes ago

Check my understanding & follow-up Qs:

An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.

Architecture.:

    Model being analyzed (M): >|||||>  
    Auto-Verbalizer (AV) same as M, with tokens for activation: >|||||>  
    Auto-Reconstructor (AR) truncated up to the layer being analyzed: ||>

The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.

The AR is trained on a simple reconstruction loss.

The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).

- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?

- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?

NitpickLawyer an hour ago

> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.

Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.

[-]

fredericoluz an hour ago

same. i'm trying to trigger the 'mom is in the next room' russian thing but the model thinks the sentence is from american reddit.

fredericoluz an hour ago

it seems that the examples they showed off with haiku work. i'd guess llama is just too bad

Tossrock an hour ago

Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.

visarga 2 hours ago

Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).

[-]

sobellian 35 minutes ago

It could just invent its own secret language embedded into English akin to steganography. The explanation would not lose information but would remain uninterpretable by humans

tjohnell 2 hours ago

It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.

[-]

gavmor 8 minutes ago

Something like a textual steganography?

Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'

rotcev an hour ago

This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.

astrange an hour ago

That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.

Of course, if you use it to make any decision that can still happen eventually.

optimalsolver 19 minutes ago

Wait, so in non-verbal reasoning, Claude has the concepts of "I" and "Me"?

I thought that wasn't possible for a text generator?

firemelt 2 hours ago

finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not

I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it