How does misalignment scale with model intelligence and task complexity?

(alignment.anthropic.com)

207 points | by salkahfi 13 hours ago ago

57 comments

The comments so far seem focused on taking a cheap shot, but as somebody working on using AI to help people with hard, long-term tasks, it's a valuable piece of writing.

- It's short and to the point

- It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term

- It's informative on how these models work, informed by some of the best in the business

- It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")

[-]

kernc 11 hours ago

Other actionable insights are:

- Merge amendments up into the initial prompt.

- Evaluate prompts multiple times (ensemble).

[-]

sandos 2 hours ago

Sometimes when I was stressed, I have used several models to verify each others´ work. They usually find problems, too!

This is very useful for things that take time to verify, we have CI stuff that takes 2-3 hours to run and I hate when those fails because of a syntax error.

gopalv 12 hours ago

> Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.

Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.

My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.

So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.

Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.

We can escalate to higher authority and get out of that mess faster if we fail hard and early.

The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.

Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.

[1] - https://arxiv.org/abs/2601.14351

[-]

Nevermark 8 hours ago

> Coherence requires 2 opposing forces

This seems very basic to any kind of information processing beyond straight shot predictable transforms.

Expansion and reduction of possibilities, branches, scope, etc.

Biological and artificial neural networks converging into multiple signals, that are reduced by competition between them.

Scientific theorizing, followed by experimental testing.

Evolutionary genetic recombination and mutation, winnowed back by resource competition.

Generation, reduction, repeat.

In a continually coordinated sense too. Many of our systems work best by encouraging simultaneous cooperation and competition.

Control systems command signal proportional to demand, vs. continually reverse-acting error feedback.

[-]

gopalv 6 hours ago

> This seems very basic

Yes, this is not some sort of hard-fought wisdom.

It should be common sense, but I still see a lot of experiments which measure the sound of one hand clapping.

In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.

If you don't really want the experiments and data from the academic paper, we have a white paper which is completely obvious to anyone who's read High Output Management, Mythical Man Month and Philosophy of Software Design recently.

Nothing in there is new, except the field it is applied to has no humans left.

[-]

Nevermark 4 hours ago

> Yes, this is not some sort of hard-fought wisdom.

By basic I didn't mean uninteresting.

In fact, despite the pervasiveness and obviousness of the control and efficiency benefits of push-pull, generating-reducing, cooperation-competition, etc., I don't think I have ever seen any kind of general treatment or characterization that pulled all these similar dynamics together. Or a hierarchy of such.

> In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.

I think it is the fact that the agents are operating coherently with the respective complementary goals. Whereas, asking one agent to both solve and judge creates conflicting constraints before a solution has begun.

Creative friction.

I am reminded of brainstorming sessions, where it is so important to note ideas, but not start judging them, since who knows what crazy ideas will fit or spark together. Later they can be selected down.

So we institutionalize this separation/staging with human teams too, even if it is just one of us (within our context limits, over two inference sessions :).

maxkfranz 9 hours ago

More or less, delegation and peer review.

CuriouslyC 12 hours ago

This is a good line: "It found that smarter entities are subjectively judged to behave less coherently"

I think this is twofold:

1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.

2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.

[-]

energy123 12 hours ago

Incoherence is not error.

You can have a vanishingly small error and an incoherence at its max.

That would be evidence of perfect alignment (zero bias) and very low variance.

xanderlewis 12 hours ago

What do 'domain valleys' and 'tunneling' mean in this context?

[-]

FuckButtons 8 hours ago

So, the hidden mental model that the OP is expressing and failed to elucidate on is that llm’s can be thought of as compressing related concepts into approximately orthogonal subspaces of the vector space that is upper bounded by the superposition of all of their weights. Since training has the effect of compressing knowledge into subspaces, a necessary corollary of that fact is that there are now regions within the vector space that contain nothing very much. Those are the valleys that need to be tunneled through, ie the model needs to activate disparate regions of its knowledge manifold simultaneously, which, seems like it might be difficult to do. I’m not sure if this is a good way of looking at things though, because inference isn’t topology and I’m not sure that abstract reasoning can be reduced down to finding ways to connect concepts that have been learned in isolation.

esafak 10 hours ago

A hallmark of intelligence is the ability to find connections between the seemingly disparate.

[-]

Earw0rm 2 minutes ago

That's also a hallmark of some mental/psychological illnesses (paranoid schizophrenia family) and use of certain drugs, particularly hallucinogens.

The hallmark of intelligence in this scenario is not just being able to make the connections, but being able to pick the right ones.

ithkuil an hour ago

The word "seemingly" is doing a lot of work here.

Sometimes things that look very different actually are represented with similar vectors in latent space.

When that happens to us it "feels like" intuition; something you can't really put a finger on and might require work to put into a form that can be transferred to another human that has a different mental model

w10-1 5 hours ago

Actually, a hallmark could be to prune illusory connections, right? That would decrease complexity rather than amplifying it.

TonyStr 6 hours ago

Does this make conspiracy theorists highly intelligent?

[-]

gylterud 5 hours ago

No, but they emulate intelligence by making up connections between seemingly disparate things, where there are none.

esyir 12 hours ago

Not the OP, but my interpretation here is that if you model the replies as some point in a vector space, assuming points from a given domain cluster close to each other, replies that span two domains need to "tunnel" between these two spaces.

booleandilemma 6 hours ago

> the ability to traverse between domain valleys in the cognitive manifold.

Couldn't you have just said "know about a lot of different fields"? Was your comment sarcastic or do you actually talk like that?

[-]

reverius42 3 hours ago

I think they mean both "know about a lot of different fields" and also "be able to connect them together to draw inferences", the latter perhaps being tricky?

p-e-w 11 hours ago

> When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.

Insights are “deep” not on their own merit, but because they reveal something profound about reality. Such a revelation is either testable or not. If it’s testable, distinguishing it from bullshit is relatively easy, and if it’s not testable even in principle, a good heuristic is to put it in the bullshit category by default.

[-]

CuriouslyC 10 hours ago

This was not my experience studying philosophy. After Kant there was a period where philosophers were basically engaged in a centuries long obfuscated writing competition. The pendulum didn't start to swing back until Neitchze. It reminded me of legal jargon but more pretentious and less concrete.

[-]

root_axis 10 hours ago

It seems to me that your anecdote exemplifies the their point.

skydhash 11 hours ago

The issue is the revelation. It's always individual at some level. And don't forget our senses are crude. The best way is to store "insights" as information until we collect enough data that we can test it again (hopefully without a lot of bias). But that can be more than a lifetime work, so sometimes you have to take some insights at face value based on heuristics (parents, teachers, elder, authority,...)

smy20011 12 hours ago

I think It's not because AI working on "misaligned" goals. The user never specify the goal clearly enough for AI system to work.

However, I think producing detailed enough specification requires same or even larger amount of work than writing code. We write rough specification and clarify these during the process of coding. I think there are minimal effort required to produce these specification, AI will not help you speed up these effort.

[-]

cobblestone32 an hour ago

> The user never specify the goal clearly enough for AI system to work.

This is sort of a fundamental problem with all AI. If you tell a robot assistant to "make a cup of tea", how's it supposed to know that that implies "don't break the priceless vase in the kitchen" and "don't step on the cat's tail", et cetera. You're never going to align it well enough with "human values" to be safe. Even just defining in human-understandable terms what those values are is a deep existential question of philosophy, let alone specifying it for a machine that's capable of acting in the world independently.

crabmusket 12 hours ago

That makes me wonder about the "higher and higher-level language" escalator. When you're writing in assembly, is it more work to write the code than the spec? And the reverse is true if you can code up your system in Ruby? If so, does that imply anything about the "spec driven" workflow people are using with AIs? Are we right on the cusp where writing natural language specs and writing high level code are comparably productive?

[-]

jaggederest 10 hours ago

I believe that the issue right now is that we're using languages designed for human creation in an AI context. I think we probably want languages that are optimized for AI written but human read code, so the surface texture is a lot different.

My particular hypothesis on this is something that feels a little bit like python and ruby, but has an absolutely insane overkill type system to help guide the AI. I also threw in a little lispiness on my draft: https://github.com/jaggederest/locque/

[-]

gf000 5 hours ago

I don't know, LLMs strive on human text, so I would wager that a language designed for humans would quite closely match an ideal one for LLMs. Probably the only difference is that LLMs are not "lazy", they better tolerate boilerplate, and lower complexity structures likely fit them better. (E.g. they can't really one-shot understand some imported custom operator that is not very common in its training data)

Also, they rely surprisingly closely on "good" code patterns, like comments and naming conventions.

So if anything, a managed language [1] with a decent type system and not a lot of features would be the best, especially if it has a lot of code in its training data. So I would rather vote on Java, or something close.

[1] reasoning about life times, even if aided by the compiler is a global property, and LLMs are not particularly good at that

charcircuit 11 hours ago

If you are on the same wave length as someone you don't need to produce a full spec. You can trust that the other person has the same vision as you and will pick reasonable ways to implement things. This is one reason why personalized AI agents are important.

skydhash 11 hours ago

Programming languages can be a thinking tool for a lot of tasks. Very much like a lot of notation, like music sheet and map drawing. A condensed and somewhat formal manner of describing ideas can increase communication speed. It may lack nuance, but in some case, nuance is harmful.

The nice thing about code compared to other notation is that it's useful on its. You describe an algorithm and the machine can then solve the problem ad infinitum. It's one step instead of the two step of writing a spec and having an LLM translate it, then having to verify the output and alter it.

Assembly and high level languages are equivalent in terms of semantics. The latter helps in managing complexity, by reducing harmful possibilities (managing memory, off-by-one errors) and presenting common patterns (iterators/collections, struct and other data structures, ....) so that categories of problems are easily solved. There's no higher level of computing model unlocked. Just faster level of productivity unlocked by following proven patterns.

Spec driven workflow is a mirage, because even the best specs will leave a lot of unspecified details. Which are crucial as most of programming is making the computer not do the various things it can do.

[-]

crabmusket 11 hours ago

> most of programming is making the computer not do the various things it can do

This is a very stimulating way of putting it!

dmix 10 hours ago

> I think producing detailed enough specification requires same or even larger amount of work than writing code

Our team has started dedicating much more time writing documentation for our SaaS app, no one seems to want to do it naturally, but there is very large potential for opening your system to machine automation. Not just for coding but customer facing tooling. I saw a preview of that possible future using NewRelic where they have an AI chat use their existing SQL-like query language to build tables and charts from natural language queries right in the web app. Theirs kinda sucks but there's so much potential there that it is very likely going to change how we build UIs and software interfaces.

Plus it also helps sales, support, and SEO having lots of documentation on how stuff works.

hogehoge51 10 hours ago

My thought too. To extend this coding agents will make code cheap, specifications cheaper, but may also invert the relative opportunity cost of not writing a good spec.

bob1029 4 hours ago

You simply can't have a single shot context with so many simultaneous constraints and expect to make forward progress. This cannot be solved with additional silicon, power or data.

Smaller prompts and fewer tools tends to be more stable. I try to stay within 1000 tokens and 10 tools for a single inference pass. I become visibly amused when I read many of the system prompts out there. Anthropomorphism is the biggest anti pattern with these models. It's a very easy and comfortable trap to fall into.

The core issue I see with coding agents is that the moment you read a file, you've polluted the context in terms of token coherence. It's probably not critical in most cases, but it's safer to pretend like it is. Recursive/iterative decomposition of the problem is the only thing I've seen so far that can scale arbitrarily. For example, if you invoke a sub agent every time you read a file, you can reduce the impact to the token budget of the caller by orders of magnitude. The callee can return a brief summary or yes/no response to the caller after reading 500kb of source. This applies at each level of recursion and can compound dramatically (exponentially) over just a few nested calls.

anupamchugh 7 hours ago

The "natural overthinking increases incoherence" finding matches my daily experience with Claude.

I maintain ~100 custom skills (specialized prompts). Sometimes Claude reads a skill, understands it, then overthinks itself into "helpful" variations that break the workflow.

Has anyone else found prompt density affects coherence?

BenoitEssiambre 9 hours ago

This matches my intuition. Systematic misalignment seems like it could be prevented by somewhat simple rules like the hippocratic oath or Asimov's Laws of robotics or rather probabilistic bayesian versions of these rules that take into account error bounds and risk.

The probabilistic version of "Do No Harm" is "Do not take excessive risk of harm".

This should work as AIs become smarter because intelligence implies becoming better bayesians which implies being great at calibrating confidence intervals of their interpretations and their reasoning and basically gaining a superhuman ability for evaluating the bounds of ambiguity and risk.

Now this doesn't mean that AIs won't be misaligned, only that it should be possible to align them. Not every AI maker will necessarily bother to align them properly, especially in adversarial, military applications.

leahtheelectron 10 hours ago

It's nice seeing this with Sohl-Dickstein as the last author after reading this blog post from him some time ago: https://sohl-dickstein.github.io/2023/03/09/coherence.html

tbrownaw 10 hours ago

Longer thinking sections have more space for noise to accumulate?

bjt12345 9 hours ago

> This suggests that scaling alone won't eliminate incoherence. As more capable models tackle harder problems, variance-dominated failures persist or worsen.

This is a big deal, but are they only looking at auto-regressive models?

bazzmt 6 hours ago

"model failures become increasingly dominated by incoherence rather than systematic misalignment."

This should not be surprising.

Systematic misalignment, i.e., bias, is still coherent and rational, if it is to be systematic. This would require that AI reason, but AI does not reason (let alone think), it does not do inference.

eande 7 hours ago

The findings are based on older models and assuming recent models behave similarly, what kind of prompt style one should use then to improve the outcome to avoid the increase in variance especially when you ask a model to solve really complex problems?

nayroclade 12 hours ago

The models they tested are already way behind the current state-of-the-art. Would be interesting to see if their results hold up when repeated with the latest frontier models.

cadamsdotcom 10 hours ago

When humans dream, we are disconnected from the world around us. Without the grounding that comes from being connected to our bodies, anything can happen in a dream.

It is no surprise that models need grounding too, lest their outputs be no more useful than dreams.

It’s us engineers who give arms and legs to models, so they can navigate the world and succeed at their tasks.

[-]

sayamqazi 5 hours ago

Also since dreams are built from the combinations of experiences that brain already knows so we cannot die in a dream as our brain does not know how to replicate what it would feel like after being dead. Basically LLMs also cannot produce truly novel ideas.

lewdwig 4 hours ago

I guess it’s reassuring to know Hanlon’s Razor holds for AGI too.

gnarlouse 7 hours ago

I feel vindicated when I say that the superintelligence control problem is a total farce, we won't get to superintelligence, it's tantamount to a religious belief. The real problem is the billionaire control problem. The human-race-on-earth control problem.

[-]

MrOrelliOReilly 7 hours ago

I don’t believe the article makes any claims on the infeasibility of a future ASI. It just explores likely failure modes.

It is fine to be worried about both alignment risks and economic inequality. The world is complex, there are many problems all at once, we don’t have to promote one at the cost of the other.

HNisCIS 7 hours ago

Yeah article aside, looking back on all the AGI stuff from the last year or so really puts our current moment in protective.

This whole paradigm of AI research is cool and all but it's ultimately a simple machine that probabilistically forms text. It's really good at making stuff that sounds smart but like looking at an AI picture, it falls apart the harder you look at it. It's good at producing stuff that looks like code and often kinda works but based on the other comments in this thread I don't think people really grasp how these models work.

hogehoge51 10 hours ago

My ignorant question: They did bias and variance noise, how about quantisation noise? I feel like sometimes agents are "flipfloping" between metastable divergent interpretations of the problem or solution.

root_axis 9 hours ago

This is very interesting research and a great write up.

I just want to nitpick something that really annoys me that has become extremely common: the tendency to take every opportunity to liken all qualities of LLMs to humans. Every quirk, failure, oddity, limitation, or implementation detail is relentlessly anthropomorphized. It's to the point where many enthusiasts have convinced themselves that humans think by predicting the next token.

It feels a bit like a cult.

Personally, I appreciate more sobriety in tech, but I can accept that I'm in the minority in that regard.

IgorPartola 12 hours ago

For some reason the article reads to me like “AI is not evil, it just has accidents when it loses coherence.” Sounds a lot like liability shifting.

[-]

dmix 10 hours ago

They compared it to industrial accidents. I don't think a software company would try to shift liability by comparing themselves to factories explosions and chemical spills.

tsunamifury 12 hours ago

I don’t know why it seems so hard for these guys to understand you scorecard every step for new strategy to Close distance at goal and if you have multiple generated forward options with no good weight you spawn a new agent and multiple paths. Then you score all the terminal branches and prune.

LLMs aren’t constrained to linear logic like your average human.

throwpoaster 12 hours ago

[flagged]

[-]

dang 9 hours ago

Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.