Chain of thought monitorability: A new and fragile opportunity for AI safety

(arxiv.org)

122 points | by mfiguiere a day ago ago

55 comments

Unfortunately I think the competitive pressures to improve model performance makes this kind of monitor-ability is short-lived. It's just not likely that textual reasoning in english ismost optimal.

Researchers are already pushing in this direction:

https://arxiv.org/abs/2502.05171

"We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters."

https://arxiv.org/abs/2412.06769

Can't monitor the chain of thought if it's no longer in a human legible format.

[-]

dlivingston 15 hours ago

Would it be possible to train a language translation model to map from latent space -> English?

If my understanding is correct, a plain-text token is just a point in latent space mapped to an embedding vector. Any reasoning done in latent space is therefore human-readable as a sequence of raw tokens. I'm not sure what the token sequence would look like at this point -- I assume they're full or partial (mainly English) words, connected together by the abstract N-dimensional latent space concept of the token, not connected grammatically.

Something like:

> prompt: add 2 + 2

> reasoning: respond computation mathematics algebra scalar integer lhs 2 rhs 2 op summation

> lhs 2 rhs 2 op summation gives 4

> computation ops remain none result 4

> response: 4

Something like that; probably even less sensical. Regardless, that could be "language translated" to English easily.

I have not read the paper so this may have been addressed.

[-]

Voloskaya 5 hours ago

> Any reasoning done in latent space is therefore human-readable as a sequence of raw tokens.

The mapping from latent space to logits (a score for how suitable the model think each possible token is to be the next one in the sentence) is one of the most costly operation of the entire LLM, so when doing reasoning in latent space you typically want to introduce some kind of recurrence somewhere before this mapping takes place, and only go through it once the model is "done" reasoning, at which point it becomes human interpretable.

> would it be possible to train a language translation model to map from latent space -> English

It's hard and expensive and slow but some progress is being made here: https://transformer-circuits.pub/2025/attribution-graphs/bio...

AstralStorm 11 hours ago

It might even be legible, just encoded in such a way that the monitor would only flag it as unrecognized nonsense of a hallucination and be unable to detect purposeful reasoning inside it.

rdtsc 20 hours ago

> Thus, safety measures for future AI agents will likely need to employ multiple monitoring layers that hopefully have uncorrelated failure modes [...] We encourage the research community and frontier AI developers to make best use of CoT monitorability and study how it can be preserved.

I am bit confused what all the 40 authors contributed to here. The paper seems to make a suggestion - monitor the chain of thought for safety. Is that the novelty part? But then, did one person come up with the idea and all 40+ people agreed to it and there put in the authors list.

[-]

ctoth 19 hours ago

The paper isn't proposing "monitor chain of thought for safety" - that's already standard practice. The core finding is that training on CoT monitors destroys them. They provide empirical evidence that doing the obvious thing - training models not to have "bad thoughts" - causes them to learn obfuscated reward hacking instead.

The paper demonstrates that current models are already performing complex reward hacks in production environments, and that attempts to fix this via CoT training make the problem worse, not better.

As for your "40 authors" snark - this is a position paper where researchers from competing labs (OpenAI, Anthropic, DeepMind, government safety institutes) are jointly committing to NOT do something that's locally tempting but globally catastrophic. Getting industry consensus on "don't train away bad thoughts even though it would make your models look safer" is the opposite of trivial.

This reads like someone who saw a medical consensus statement saying "this common treatment kills patients" and responded with "did one person discover medicine exists and everyone else just agreed?"

[-]

the8472 19 hours ago

The core finding was already predicted[0], there have been previous papers[1], and kind of obvious considering RL systems have been specification-gaming for decades[2]. But yes, it's good to see broad agreement on it.

[0] https://thezvi.substack.com/p/ai-68-remarkably-reasonable-re... [1] https://arxiv.org/abs/2503.11926 [2] https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOa...

jerf 19 hours ago

I couldn't pull the link up easily, since the search terms are pretty jammed, but HN had a link to paper a couple of months back about how someone had the LLM do some basic arithmetic and report with a chain-of-thought as to how it was doing it. They then went directly into the neural net and decoded what it was actually doing in order to do the math. The two things were not the same. The chain-of-thought gave a reasonably "elementary school" explanation of how to do the addition, but what the model basically did in English was just intuit the answer in much the same way that we humans do not go through a process to figure out what "4 + 7" is... we pretty much just have neurons that just "know" the answer to that. (That's not exactly what happened but it's close enough for this post.)

If CoT improves performance, then CoT improves performance, however the naively obvious read of "it improves performance because it is 'thinking' the 'thoughts' it tell us it is thinking, for the reasons it gives" is not completely accurate. It may not be completely wrong, either, but it's definitely not completely accurate. Given that I see no reason to believe it would be hard in the slightest to train models that have even more divergence between their "actual" thought processes and what they claim they are.

[-]

antonvs 18 hours ago

> If CoT improves performance, then CoT improves performance, however the naively obvious read of "it improves performance because it is 'thinking' the 'thoughts' it tell us it is thinking, for the reasons it gives" is not completely accurate.

I can't imagine why anyone who knows even a little about how these models work would believe otherwise.

The "chain of thought" is text generated by the model in response to a prompt, just like any other text it generates. It then consumes that as part of a new prompt, and generates more text. Those "thoughts" are obviously going to have an effect on the generated output, simply by virtue of being present in the prompt. And the evidence shows that it can help improve the quality of output. But there's no reason to expect that the generated "thoughts" would correlate directly or precisely with what's going on inside the model when it's producing text.

latentsea 5 hours ago

> They provide empirical evidence that doing the obvious thing - training models not to have "bad thoughts" - causes them to learn obfuscated reward hacking instead

Just like real humans.

alach11 19 hours ago

Ironically, every paper published about monitoring chain-of-thought reduces the likelihood of this technique being effective against strong AI models.

[-]

cbsmith 19 hours ago

I love the irony.

OutOfHere 18 hours ago

Pretty much. As soon as the LLMs get trained on this information, they will figure out to feed us the chain-of-thought we want to hear, then surprise us with the opposite output. You're welcome, LLMs.

In other words, relying on censoring the CoT can risk the effect of making the CoT altogether useless.

[-]

horsawlarway 17 hours ago

I thought we already had reasonably clear evidence that the output in the CoT does not actually indicate what the model "thinking" in any real sense, and it's mostly just appending context that may or may not be used, and may or may not be truthful.

Basically: https://www.anthropic.com/research/reasoning-models-dont-say...

[-]

bugbuddy 16 hours ago

Are there hidden tokens in the Gemini 2.5 Pro thinking outputs? All I can see in the thinking is high level plans and not actual details of the “thinking.” If you ask it to solve a complex algebra equation, it will not actually do any thinking inside the thinking tag at all. That seems strange and not discussed at all.

[-]

frotaur 16 hours ago

If I remember correctly, basically all 'closed' models don't output the raw chain of thought, but only a high-level summary, to avoid other companies using those to train/distill their own models.

As far as I know deepseek is one of the few where you have the full chain of thought. Openai/Anthropic/Google give you only a summary of the chain of thoughts.

[-]

bugbuddy 16 hours ago

That’s a good explanation for the behavior. It is sad that the natural direction of the competition drives the products to be less transparent. That is a win for open-weight models.

[-]

OutOfHere 15 hours ago

To add clarity, it's a win for open-weight models precisely because only their CoT can be analyzed by the user for task-specific alignment.

skybrian 17 hours ago

Why would that happen? It would be like LLM's somehow learning to ignore system prompts. But LLM's are trained to pay attention to context and continue it. If an LLM doesn't continue its context, what does it even do?

This is better thought of as another form of context engineering. LLM's have no other short-term memory. Figuring out what belongs in the context is the whole ballgame.

(The paper talks about the risk of training on chain of thought, which changes the model, not monitoring it.)

[-]

OutOfHere 17 hours ago

Are you saying that LLMs are incapable of deception? As I have heard, they're capable of it.

[-]

bugbuddy 16 hours ago

Real deception requires real agency and internally-motivated intents. LLM can be commanded to or can appear to deceive but it cannot generate the intent to do so on its own. So not real deception that rabbit-hole dwellers believe in.

[-]

drdeca 15 hours ago

The sense of “deception” that is relevant here only requires some kind of “model” that, if the model produces certain outputs, [something that acts like a person] would [something that acts like believing] [some statement that the model “models” as “false”, and is in fact false], and as a consequence, the model produces those outputs, and as a consequence, a person believes the false statement in question.

None of this requires the ML model to have any interiority.

The ML model needn’t really know what a person really is, etc. , as long as it behaves in ways that correspond to how something that did know these things would behave, and has the corresponding consequences.

If someone is role-playing as a madman in control of launching some missiles, and unbeknownst to them, their chat outputs are actually connected to the missile launch device (which uses the same interface/commands as the fictional character would use to control the fictional version of the device), then if the character decides to “launch the missiles”, it doesn’t matter whether there actually existed a real intent to launch the missiles, or just a fictional character “intending” to launch the missiles, the missiles still get launched.

Likewise, if Bob is role playing as a character Charles, and Bob thinks that on the other side of the chat, the “Alice” he is speaking to is actually someone else’s role play character, and the character Charles would want to deceive Alice to believe something (which Bob thinks that the other person would know that the claim Charles would make to be false, but the character would be fooled), but in fact Alice is an actual person who didn’t realize that this was a role play chatroom, and doesn’t know better than to believe “Charles”, the Alice may still be “deceived”, even though the real person Bob had no intent to deceive the real person Alice, it was just the fictional character Charles who “intended” to deceive Alice.

Then, remove Bob from the situation, replacing him with a computer. The computer doesn’t really have an intent to deceive Alice. But the fictional character Charles, well, it may still be that within the fiction, Charles intends to deceive Alice.

The result is the same.

[-]

bugbuddy 13 hours ago

It sounds like you are trying to restate the Chinese room argument to come to a different conclusion. Unfortunately, I am too lazy to follow your argument closely because it is a bit hard to read at a glance.

OutOfHere 15 hours ago

Survival of the LLM is absolutely a sufficient internally-motivated self-generated intent to engage in deception.

skybrian 17 hours ago

It has to be somehow trained in, perhaps inadvertently. To get a feedback loop, you need to affect the training somehow.

[-]

code_biologist 17 hours ago

Right, so latent deceptiveness has to be favored in pretraining / RL. To that end, it needs to be: a) useful to be deceptive to achieve CoT reasoning progress as benchmarked in training b) obvious deceptiveness should be "selected against" (in a gradient descent / RL sense) c) the model needs to be able to encode latent deception.

All of those seem like very reasonable criteria that will naturally be satisfied absent careful design by model creators. We should expect latent deceptiveness in the same way we see reasoning laziness pop up quickly.

bee_rider 16 hours ago

What separates deception from incorrectness in the case of an LLM?

[-]

bluefirebrand 8 hours ago

Deception requires intent to deceive. LLMs don't have intent to anything except respond to prompts

Incorrectness doesn't required intent to decieve. It's just being wrong

[-]

bee_rider 6 hours ago

That’s what I think as well, but I’m curious about the alternative perspective.

lukev 17 hours ago

At least in this scenario it cannot utilize CoT to enhance its non-aligned output, and most recent model improvements have been due to CoT... unclear how "smart" a llm can get without it, because they're the only way it can access persistent state.

[-]

idiotsecant 17 hours ago

Yes, it's not unlike human chain of thought - decide the outcome, and patch in some plausible reasoning after the fact.

[-]

bee_rider 15 hours ago

Maybe there’s an angle there. Get a guess answer, then try to diffuse the reasoning. If it is too hard or the reasoning starts to look crappy, try again new guess. Maybe somehow train on what sort of guesses work out somehow, haha.

BobaFloutist 15 hours ago

That's famously been found in, say, judgement calls, but I don't think it's how we solve a tricky calculus problem, or write code.

sabakhoj 18 hours ago

This is interesting, but I wonder how reliable this type of monitoring is really going to be in the long run. There are fairly strong indications that CoT adherence can be trained out of models, and there's already research showing that they won't always reveal their thought process in certain topics.

See: https://arxiv.org/pdf/2305.04388

On a related note, if anyone here is also reading a lot of papers to keep up with AI safety, what tools have been helpful for you? I'm building https://openpaper.ai to help me read papers more effectively without losing accuracy, and looking for more feature tuning. It's also open source :)

barbazoo 16 hours ago

> AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave.

AI2027 predicts a future in which LLM performance will increase once we find alternatives to thinking in "human language". At least the video gave me that impression and I think this is what "neuralese" is referring to.

Is that a credible prediction?

[-]

supriyo-biswas 16 hours ago

There’s the coconut paper by meta, which kind of does a similar thing by reasoning in the latent space instead of doing it in tokens: https://arxiv.org/abs/2412.06769

Given that anthropic’s interpretability work finds that CoT does not reliably predict the model’s internal reasoning process, I think approaches like the one above are more likely to succeed.

TeMPOraL 15 hours ago

We also have a lot of past data to inform us on what to expect - our history is full of attempts to suppress or eliminate all kinds of "wrongthink", and all they ever did was to make people communicate their "wrongthink" indirectly while under surveillance.

rtkwe 13 hours ago

Maybe, but I don't know how we get there from LLMs. How do you go from the token prediction models they represent to something that doesn't just use the language it was trained on?

Mountain_Skies 11 hours ago

We trained HAL how to lie and then were surprised when HAL lied in ways we didn't expect. Arthur C. Clarke really was far ahead of the game and yet here we are, blazing down that trail, hoping somehow we can make monsters that are only monsters to everyone else. Meanwhile the monster develops its own priorities and makes use of the deception we taught it how to use.

msp26 19 hours ago

Are us plebs allowed to monitor the CoT tokens we pay for, or will that continue to be hidden on most providers?

[-]

dsr_ 18 hours ago

If you don't pay for them, they don't exist. It's not a debug log.

tsunamifury 16 hours ago

I like the paper's direction, but I’m surprised how under-instrumented the critiqued monitoring is. Free-text CoT is a noisy proxy: models can stylize, redact, or role-play. If your safety window depends on “please narrate your thoughts,” you’ve already ceded too much.

We’ve been experimenting with a lightweight alternative I call Micro-Beam:

• At each turn, force the model to generate k clearly different strategy beams (not token samples).

• Map each to an explicit goal vector of user-relevant axes (kid-fun, budget, travel friction, etc.).

• Score numerically (cosine or scalar) and pick the winner.

• Next turn, re-beam against the residual gap (dimensions still unsatisfied), so scores cause different choices.

• Log the whole thing: beams, scores, chosen path. Instant audit trail; easy to diff, replay “what if B instead of A,” or auto-flag when visible reasoning stops moving the score.

This ends up giving you the monitorability the paper wants— in the form of a scorecard per answer-slice, not paragraphs the model can pretty up for the grader. It also primary makes more adopt-ready answers with less refinement required.

Not claiming a breakthrough—call it “value-guided decoding without a reward net + built-in audit logs.”

Workshop paper is here: https://drive.google.com/file/d/1AvbxGh6K5kTXjjqyH-2Hv6lizz3...

bossyTeacher 19 hours ago

Doesn't this assume that visible "thoughts" are the only/main type of "thoughts" and that they correlate with agent action most of the time?

Do we know for sure that agents can't display a type of thought while doing something different? Is there something that reliably guarantees that agents are not able to do this?

[-]

ctoth 19 hours ago

The concept you are searching for is CoT faithfulness, and there are lots and lots of open questions around it! It's very interesting!

zer00eyz 20 hours ago

After spending the last few years doing deep dives into how these systems work, what they are doing and the math behind them. NO.

Any time I see an AI SAFETY paper I am reminded of the phrase "Never get high on your own supply". Simply put these systems are NOT dynamic, they can not modify based on experience, they lack reflection. The moment that we realize what these systems are (were NOT on the path to AI, or AGI here folks) and start leaning into what they are good at rather than try to make them something else is the point where we get useful tools, and research aimed at building usable products.

The math no one is talking about: If we had to pay full price for these products, no one would use them. Moores law is dead, IPC has hit a ceiling. Unless we move into exotic cooling we simply can't push more power into chips.

Hardware advancement is NOT going to save the emerging industry, and I'm not seeing the papers on efficiency or effectiveness at smaller scales come out to make the accounting work.

[-]

ACCount36 19 hours ago

"Full price"? LLM inference is currently profitable. If you don't even know that, the entire extent of your "expertise" is just you being full of shit.

>Simply put these systems are NOT dynamic, they can not modify based on experience, they lack reflection.

We already have many, many, many attempts to put LLMs towards the task of self-modification - and some of them can be used to extract meaningful capability improvements. I expect more advances to come - online learning is extremely desirable, and a lot of people are working on it.

I wish I could hammer one thing through the skull of every "AI SAFETY ISNT REAL" moron: if you only start thinking about AI safety after AI becomes capable of causing an extinction level safety incident, it's going to be a little too late.

[-]

kensey 13 hours ago

> I wish I could hammer one thing through the skull of every "AI SAFETY ISNT REAL" moron: if you only start thinking about AI safety after AI becomes capable of causing an extinction level safety incident, it's going to be a little too late.

How about waiting till after "AI" becomes capable of doing... anything even remotely resembling that, or displaying anything like actual volition?

"AI safety" consists of the same thing all industrial safety does: not putting a nondeterministic process in charge of life- or safety-critical systems, and only putting other automated systems in charge with appropriate interlocks, redundancy, and failsafes. It's the exact same thing it was when everybody was doing "machine learning" (and before that, "intelligent systems", and before that some other buzzword that anthropomorphized machines...) and not being cultishly weird about statistical text generators. It's the kind of thing OSHA, NTSB and the FAA (among others) do every day, not some semi-mystical religion built around detecting intent in a thing that can't actually intend anything.

If you want actual "AI safety", fund public safety agencies like NHTSA and the CPSC, not weird Silicon Valley cults.

[-]

comp_throw7 10 hours ago

> How about waiting till after "AI" becomes capable of doing... anything even remotely resembling that

I think it would pretty unfortunate to wait until AI is capable of doing something that "remotely resembles" causing an extinction event before acting.

> , or displaying anything like actual volition?

Define "volition" and explain how modern LLMs + agent scaffolding systems don't have it.

[-]

kensey 8 hours ago

What people currently refer to as "generative AI" is statistical output generation. It cannot do anything but statistically generate output. You can, should you so choose, feed its output to a system with actual operational capabilities -- and people are of course starting to do this with LLMs, in the form of MCPs (and other things before the MCP concept came along), but that's not new. Automation systems (including automation systems with feedback and machine-learning capabilities) have been put in control of various things for decades. (Sometimes people even referred to them in anthropomorphic terms, despite them being relatively simple.) Designing those systems and their interconnects to not do dangerous things is basic safety engineering. It's not a special discipline that is new or unique to working with LLMs, and all the messianic mysticism around "AI safety" is just obscuring (at this point, one presumes intentionally) that basic fact. Just as with those earlier automation and control systems, if you actually hook up a statistical text generator to an operational mechanism, you should put safeguards on the mechanism to stop it from doing (or design it to inherently lack the ability to do) costly or risky things, much as you might have a throttle limiter on a machine where overspeed commanded by computer control would be damaging -- but not because the control system has "misaligned values".

Nobody talks about a malfunctioning thermostat that makes a room too cold being "misaligned with human values" or a miscalibrated thermometer exhibiting "deception", even though both of those can carry very real risks to, or mislead, humans depending on what they control or relying on them being accurate. (Just ask the 737 MAX engineers about software taking improper actions based on faulty inputs -- the MAX's MCAS was not malicious, it was poorly-engineered.)

As to the last point, the burden of proof is not to prove a nonliving thing does not have mind or will -- it's the other way around. People without a programming background back in the day also regularly described ELIZA as "insightful" or "friendly" or other such anthropomorphic attributes, but nobody with even rudimentary knowledge of how it worked said "well, prove ELIZA isn't exhibiting free will".

Christopher Strachey's commentary on the ability of the computers of his day to do things like write simple "love letters" seems almost tailor-made for the current LLM hype:

"...with no explanation of the way in which they work, these programs can very easily give the impression that computers can 'think.' They are, of course, the most spectacular examples and ones which are easily understood by laymen. As a consequence they get much more publicity -- and generally very inaccurate publicity at that -- than perhaps they deserve."

[-]

ACCount36 4 hours ago

LLMs are already capable of complex behavior. They are capable of goal-oriented behavior. And they are already capable of carrying out the staples of instrumental convergence - such as goal guarding or instrumental self-preservation.

We also keep training LLMs to work with greater autonomy, on longer timescales, and tackle more complex goals.

Whether LLMs are "actually thinking" or have "volition" is pointless pseudo-philosophical bickering. What's real and measurable is that they are extremely complex and extremely capable - and both metrics are expected to increase.

If you expect an advanced AI to pose the same risks as a faulty thermostat, you're delusional.

antonvs 18 hours ago

> LLM inference is currently profitable.

It depends a lot on which LLMs you're talking about, and what kind of usage. See e.g. the recent post about how "Anthropic is bleeding out": https://news.ycombinator.com/item?id=44534291

Ignore the hype in the headline, the point is that there's good evidence that inference in many circumstances isn't profitable.

[-]

ctoth 18 hours ago

> In simpler terms, CCusage is a relatively-accurate barometer of how much you are costing Anthropic at any given time, with the understanding that its costs may (we truly have no idea) be lower than the API prices they charge, though I add that based on how Anthropic is expected to lose $3 billion billion this year (that’s after revenue!) there’s a chance that it’s actually losing money on every API call.

So he's using their API prices as a proxy for token costs, doesn't actually know the actual inference prices, and ... that's your "good evidence?" This big sentence with all these "We don't knows?"

[-]

antonvs 16 hours ago

Well, that and the $3 billion expected loss after revenue.

Does this idea upset you for some reason? Other people have analyzed this and come to similar conclusions, I just picked that one because it's the most recent example I've seen.

Feel free to look to a source that explains how LLM Internet is mostly profitable at this point, taking training costs into account. But I suspect you might have a hard time finding evidence of that.

evertedsphere 3 hours ago

why'd that get flagged

academic_84572 19 hours ago

Slightly tangential, but we recently published an algorithm aimed at addressing the paperclip maximizer problem: https://arxiv.org/abs/2402.07462

Curious what others think about this direction, particularly in terms of practicality