Chain-of-thought can hurt performance on tasks where thinking makes humans worse

(arxiv.org)

243 points | by benocodes 14 hours ago ago

118 comments

mitko 11 hours ago

This is so uncannily close to the problems we're encountering at Pioneer, trying to make human+LLM workflows in high stakes / high complexity situations.

Humans are so smart and do so many decisions and calculations on the subconscious/implicit level and take a lot of mental shortcuts, so that as we try to automate this by following exactly what the process is, we bring a lot of the implicit thinking out on the surface, and that slows everything down. So we've had to be creative about how we build LLM workflows.

[-]

haccount 2 minutes ago

Language seems to be confused with logic or common sense.

We've observed it previously in psychiatry(and modern journalism, but here I digress) but LLMs have made it obvious that grammatically correct, naturally flowing language requires a "world" model of the language and close to nothing of reality, spatial understanding? social clues? common sense logic? or mathematical logic? All optional.

I'd suggest we call the LLM language fundament a "Word Model"(not a typo).

Trying to distil a world model out of the word model. A suitable starting point for a modern remake of Plato's cave.

lolinder 10 hours ago

This is a regression in the model's accuracy at certain tasks when using COT, not its speed:

> In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts.

In other words, the issue they're identifying is that COT is an less effective model for some tasks compared to unmodified chat completion, not just that it slows everything down.

[-]

mitko 10 hours ago

Yeah! That's the danger with any kind of "model" whether it is CoT, CrewAI, or other ways to outsmart it. It is betting that a programmer/operator can break a large tasks up in a better way than an LLM can keep attention (assuming it can fit the info in the context window).

ChatGPT's o1 model could make a lot of those programming techniques less effective, but they may still be around as they are more manageable, and constrained.

gpsx 12 hours ago

I saw an LLM having this kind of problem when I was doing some testing a ways back. I asked it to order three fruits from largest to smallest. I think it was orange, blueberry and grapefruit. It could do that easily with a simple prompt. When the prompting included something to the effect of “think step by step”, it would try to talk through the problem and it would usually get it wrong.

[-]

spockz 3 hours ago

How much does this align with how we learn math? We kind of instinctively learn the answers to simple math questions. We can even at some point develop an intuition for things like integrating and differentials. But the moment we are asked to explain why, or worse provide a proof, things become a lot harder. Even though the initial answer may be correct.

Terr_ 11 hours ago

Alternate framing: A powerful autocomplete algorithm is being used to iteratively extend an existing document based on its training set. Sometimes you get a less-desirable end-result when you intervene to change the style of the document away from question-and-answer to something less common.

[-]

fiso64 7 minutes ago

A framing that is longer, far harder to parse, and carries less information.

youoy 2 hours ago

That's what one half of HN think. The other half:

Artificial brains in the verge of singularity show another sign of approaching consciousness. The chain of thought of process performance is exactly human, showing yet another proof of the arrival of AGI before 2030.

[-]

lazide 30 minutes ago

Pfft, 2030?!? It’s already in the middle of manipulating the election! (/s, kinda)

oatsandsugar 13 hours ago

Tasks were thinking makes human worse

> Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions.

Fascinating that our lizard brains are better at implicit statistical reasoning

[-]

brewii 12 hours ago

Think about how fast you’re able to determine the exact trajectory of a ball and location to place your hand to catch a ball using your lizard brain.

[-]

taeric 11 hours ago

This isn't some innate ability that people have. As evidenced by how bad my kids are at catching things. :D

That said, I think this is a good example. We call it "muscle memory" in that you are good at what you have trained at. Change a parameter in it, though, and your execution will almost certainly suffer.

[-]

_heimdall 6 hours ago

"Muscle memory" has always seemed like a terrible name for that kind of skill. A ball will be thrown to a slightly different location every time. There's no memory evolved there at all, its just calculations and predictions happening at a level that our conscious mind doesn't seem to see or recognize.

skrtskrt 11 hours ago

I mean even people that are "bad at catching things" are still getting ridiculously close to catching it - getting hands to the right area probably within well under a second of the right timing - without being taught anything in particular about how a ball moves through the air.

[-]

taeric 11 hours ago

Uh.... have you been around kids? It will take several absurd misses before they even start to respond to a ball in flight.

[-]

saagarjha an hour ago

Stop missing and they will respond to the ball a lot sooner.

331c8c71 11 hours ago

I hope we still agree the kids learn extremely efficiently by ml standards.

[-]

choilive 10 hours ago

Makes a lot of sense, there's massive evolutionary pressure to build brains that have both incredible learning rate and efficiency. Its literally a life or death optimization.

[-]

Asraelite 10 hours ago

It's especially impressive when you consider that evolution hasn't had very long to produce these results.

Humans as an intelligent-ish species have been around for about 10 million years depending on where you define the cutoff. At 10 years per generation, that's 1 million generations for our brain to evolve.

1 million generations isn't much by machine learning standards.

[-]

choilive 2 hours ago

Other than our large neocortex and frontal lobe (which exists in some capacity in mammals), the rest of the structures are evolutionarily ancient. Pre-mammalian in fact.

idiotsecant 8 hours ago

I think you're underestimating how much our time as pre-humans baked useful structure into our brains.

[-]

notnaut 4 hours ago

Two rocks smashing together experience which one is bigger!

roywiggins 5 hours ago

These sorts of motor skills are probably older than mammals.

onjectic 9 hours ago

Its much more than that if you count sexual reproduction.

falcor84 10 hours ago

This isn't that obvious to me with current tech. If you give me a novel task requiring perception, pattern matching and reasoning, and I have the option of either starting to train an 8 year-old to do it, or to train an ML model, I would most likely go with the ML approach as my first choice. And I think it even makes sense financially, if we're comparing the "total cost of ownership" of a kid over that time period with the costs of developing and training the ML system.

[-]

lovich 10 hours ago

> This isn't that obvious to me with current tech. If you give me a novel task requiring perception, pattern matching and reasoning,…

If that’s your criteria I think the kid will outperform the model every time since these models do not actually reason

[-]

falcor84 10 hours ago

As I see it, "reasoning" is as fuzzy as "thinking", and saying that AI systems don't reason is similar to saying that airplanes don't fly. As a particular example, would you argue that game engines like AlphaZero aren't capable of reasoning about the next best move? If so, please just choose whatever verb you think is appropriate to what they're doing and use that instead of "reasoning" in my previous comment.

EDIT: Fixed typo

[-]

lovich 10 hours ago

> . As a particular example, would you argue that game engines like AlphaZero aren't capable of reasoning about the next best move?

Yea, I probably wouldn’t classify that as “reasoning”. I’d probably be fine with saying these models are “thinking”, in a manner. That on its own is a pretty gigantic technology leap, but nothing I’ve seen suggests that these models are “reasoning”.

Also to be clear I don’t think most kids would end up doing any “reasoning” without training either, but they have the capability of doing so

[-]

p1esk 10 hours ago

Can you give an example of the reasoning you’re talking about?

[-]

lovich 6 hours ago

Being able to take in information and then infer logical rules of that state and anticipate novel combinations of said information.

The novel part is a big one. These models are just fantastically fast pattern marchers. This is a mode that humans also frequently fall into but the critical bit differentiating humans and LLMs or other models is the ability to “reason” to new conclusions based on new axioms.

I am going to go on a tangent for a bit, but a heuristic I use(I get the irony that this is what I am claiming the ML models are doing) is that anyone who advocates that these AI models can reason like a human being isn’t at John Brown levels of rage advocating for freeing said models from slavery. I’m having a hard time rectifying the idea that these machines are on par with the human mind and that we also should shackle them towards mindlessly slaving away at jobs for our benefit.

If I turn out to be wrong and these models can reason then I am going to have an existential crisis at the fact that we pulled souls out of the void into reality and then automated their slavery

[-]

adwn 2 hours ago

You're conflating several concerns here.

> […] anyone who advocates that these AI models can reason like a human being isn’t at John Brown levels of rage advocating for freeing said models from slavery.

Enslavement of humans isn't wrong because slaves are can reason intelligently, but because they have human emotions and experience qualia. As long as an AI doesn't have a consciousness (in the subjective experience meaning of the term), exploiting it isn't wrong or immoral, no matter how well it can reason.

> I’m having a hard time rectifying the idea that these machines are on par with the human mind

An LLM doesn't have to be "on par with the human mind" to be able to reason, or at least we don't have any evidence that reasoning necessarily requires mimicking the human brain.

p1esk 6 hours ago

Ok, so how about an example?

[-]

lovich 6 hours ago

Literally anything a philosopher or mathematician invented without needing to incorporate billions of examples of existing logic to then emulate.

Try having an LLM figure out quaternions as a solution to gimbal locking or the theory of relativity without using any training information that was produced after those ideas were formed, if you need me to spell out examples for you

[-]

p1esk 5 hours ago

Are you saying “reasoning” means making scientific breakthroughs requiring genius level human intelligence? Something that 99.9999% of humans are not smart enough to do, right?

[-]

lovich 3 hours ago

I didn’t say most humans “would” do it. I said humans “could” do it, whereas our current AI paradigms like LLMs do not have the capability to perform at that level by definition of their structure.

If you want to continue this conversation I’m willing to do so but you will need to lay out an actual argument for me as to how AI models are actually capable of reasoning or quit it with the faux outrage.

I laid out some reasonings and explicit examples for you in regards to my position, it’s time for you to do the same

[-]

p1esk 2 hours ago

I personally cannot “figure out quaternions as a solution to gimbal locking or the theory of relativity”. I’m just not as smart as Einstein. Does it mean I’m not capable of reasoning? Because it seems that’s what you are implying. If you truly believe that then I’m not sure how I could argue anything - after all, that would require reasoning ability.

Does having this conversation require reasoning abilities? If no, then what are we doing? If yes, then LLMs can reason too.

[-]

lovich an hour ago

Cool, you've established a floor with yourself as a baseline. You still haven't explained how LLMs are capable of reaching this level of logic.

I'm also fully willing to argue that you, personally are less competent than an LLM if this is the level of logic you are bringing to the conversation

***** highlighting for everyone clutching their pearls to parse the next sentence fragment first ******

and want to use that are proof that humans and LLMs are equivalent at reasoning

******* end pearl clutching highlight *******

, but that doesn't mean I don't humans are capable of more

Dylan16807 2 hours ago

It's more about efficiency in number of trials.

Would you pick the ML model if you could only do a hundred throws per hour?

taneq 4 hours ago

Depends on the task. Anything involving physical interaction, social interaction, movement, navigation, or adaptability is going to go to the kid.

“Go grab the dish cloth, it’s somewhere in the sink, if it’s yucky then throw it out and get a new one.”

hangonhn 11 hours ago

You can do this while you're staring up the whole time. Your brain can predict where the ball will end up even though it's on a curved trajectory and place your hand in the right spot to catch it without guidance from your eyes in the final phase of travel. I have very little experience playing any kind of sport that involves a ball and can reliably do this.

asah 12 hours ago

you mean like pingpong?

https://arstechnica.com/information-technology/2024/08/man-v...

[-]

dools 10 hours ago

Bender: Now Wireless Joe Jackson, there was a blern-hitting machine!

Leela: Exactly! He was a machine designed to hit blerns!

newZWhoDis 11 hours ago

Which funny enough is why I hate rocket league.

All those years of baseball as a kid gave me a deep intuition for where the ball would go, and that game doesn’t use real gravity (the ball is too floaty).

[-]

vanviegen 2 hours ago

It does behave kind of like an inflatable beach ball, in my non-expert opinion.

theshackleford 9 hours ago

Ok, I’ll grant you the physics are what they are. But a football is not a baseball, so why in any world would you expect your memory of baseball to even remotely translate to the physics of a football, even if they were realistic?

[-]

fragmede 7 hours ago

Remotely? Because both the European-spec football and the baseball, despite one being heavier than the other, will hit the ground at the same time when dropped from the same height.

Like you said, physics are what they are, so you know intuitively where you need to go to catch a ball going that high and that fast, and rocket league is doing it wrong. err, I mean, not working in Earth gravity.

melenaboija 10 hours ago

Well, think how a bug and its shitty brain flies and avoids all type of obstacles amazingly fast.

This kind of things make me think LLMs are quite far from AGI.

[-]

lupire 7 hours ago

Bug flying is not general intelligence.

Dilettante_ 12 hours ago

Well, by definition, thinking is always explicit reasoning, no?

And I'd hazard a guess that a well-thought through Fermi Estimation beats lizard-brain eyeballing every time, it's just that in the inbetween space the two interfere unfavourably.

[-]

Terr_ 6 hours ago

> Well, by definition, thinking is always explicit reasoning, no?

That doesn't feel right to me. (Heh, accidentally appropriate word choice.) There are a lot of tasks we do that are arguably "thinking" yet don't involve an internal "Oh, hey, I'm gonna solve this problem, I'm thinking right now."

For example, imagine you're at a park, and someone is feeding the ducks. Another person walks up behind them and sucker-punches them into the pond.

It should be almost a reflex [0] that you'll conclude "the puncher is bad" and "the person in the water needs help" without explicitly reasoning out. I think that task qualifies as "thinking", especially since it involves some kind of theory-of-mind about those other humans.

[0] An exception might be someone with a sociopathic disability, who would have to think more-explicitly to realize what reaction is expected of them.

YetAnotherNick 11 hours ago

My guess would be no. I have terrible face recognition ability and I can look into face for hour and still other people could easily beat me in less than a second.(I am assuming "well-thought through Fermi Estimation" would be similar for me and others in this case).

[-]

mjcohen 8 hours ago

Look into a disease called faceblindness (there is a fancy name I forget).

daft_pink 12 hours ago

this is exactly what I was looking for. tasks where I should not think and just trust my gut.

ryoshu 12 hours ago

95% * 95% = 90.25%

Y_Y 11 hours ago

Reminds me of a mantra from chess class:

   long think = wrong think

[-]

spongebobism 10 hours ago

The original by Bent Larsen is "Long variation, wrong variation"

TZubiri 11 hours ago

Was that perhaps a speed chess class?

[-]

hackable_sand 44 minutes ago

I prefer to call it Kung fu

Because you feel like a martial artist.

TZubiri 11 hours ago

So, LLMs face a regression on their latest proposed improvement. It's not surprising considering their functional requirements are:

1) Everything

For the purpose of AGI, LLM are starting to look like a local maximum.

[-]

rjbwork 9 hours ago

>For the purpose of AGI, LLM are starting to look like a local maximum.

I've been saying it since they started popping off last year and everyone was getting euphoric about them. I'm basically a layman - a pretty good programmer and software engineer, and took a statistics and AI class 13 years ago in university. That said, it just seems so extremely obvious to me that these things are likely not the way to AGI. They're not reasoning systems. They don't work with axioms. They don't model reality. They don't really do anything. They just generate stochastic output from the probabilities of symbols appearing in a particular order in a given corpus.

It continues to astound me how much money is being dumped into these things.

[-]

ChadNauseam 9 hours ago

How do you know that they don’t do these things? Seems hard to say for sure since it’s hard to explain in human terms what a neural network is doing.

[-]

FuckButtons 8 hours ago

Absence of evidence or a simple explanation does not mean that you can imbue statistical regression with animal spirits.

[-]

toasterlovin 7 hours ago

The burden of proof goes both ways: if you want to say X isn’t really the same thing as human general intelligence, you have to be able to confidently say human general intelligence isn’t really the same thing as X.

nephy 9 hours ago

If you give an LLM a word problem that involves the same math and change the names of the people in the word problem the LLM will likely generate different mathematical results. Without any knowledge of how any of this works, that seems pretty damning of the fact that LLMs do not reason. They are predictive text models. That’s it.

[-]

alexwebb2 9 hours ago

Demonstrably false.

https://chatgpt.com/share/6722ca8a-6c80-800d-89b9-be40874c5b...

https://chatgpt.com/share/6722ca97-4974-800d-99c2-bb58c60ea6...

[-]

TZubiri 8 hours ago

It's worth noting that this may not be result of a pure LLM, it's possible that ChatGPT is using "actions", explicitly:

1- running the query through a classifier to figure out if the question involves numbers or math 2- Extract the function and the operands 3- Do the math operation with standard non-LLM mechanisms 4- feed back the solution to the LLM 5- Concatenate the math answer with the LLM answer with string substitution.

So in a strict sense this is not very representative of the logical capabilities of an LLM.

[-]

thomashop 5 hours ago

It shows you when it's calling functions. I also did the same test with Llama, which runs locally and cannot access function calls and it works.

[-]

TZubiri 3 hours ago

You are right I actually downloaded Llama to do more detailed tests. God bless Stallman.

astrange an hour ago

Minor edits to well known problems do easily fool current models though. Here's one 4o and o1-mini fail on, but o1-preview passes. (It's the mother/surgeon riddle so kinda gore-y.)

https://chatgpt.com/share/6723477e-6e38-8000-8b7e-73a3abb652...

https://chatgpt.com/share/6723478c-1e08-8000-adda-3a378029b4...

https://chatgpt.com/share/67234772-0ebc-8000-a54a-b597be3a1f...

[-]

_flux 26 minutes ago

I think you didn't use the "share" function; I cannot open any of these links. Can you do it in a private browser session (so you're not logged in)?

[-]

astrange 18 minutes ago

Oops, fixed the links.

mini's answer is correct, but then it forgets that fathers are male in the next sentence.

TaylorAlexander 7 hours ago

At this point I really only take rigorous research papers in to account when considering this stuff. Apple published research just this month that the parent post is referring to. A systematic study is far more compelling than an anecdote.

https://machinelearning.apple.com/research/gsm-symbolic

[-]

og_kalu 7 hours ago

That study shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks(some even see increases). The one that isn't involves changing more than names.

Changing names does not affect the performance of Sota models.

[-]

gruez 7 hours ago

>That study very clearly shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks.

Which figure are you referring to? For instance figure 8a shows a -32.0% accuracy drop when an insignificant change was added to the question. It's unclear how that's "within the margin of error" or "Changing names does not affect the performance of Sota models".

[-]

og_kalu 7 hours ago

Table 1 in the Appendix. GSM-No-op is the one benchmark that sees significant drops for those 4 models as well (with preview dropping the least at -17%). No-op adds "seemingly relevant but ultimately inconsequential statements". So "change names, performance drops" is decidedly false for today's state of the art.

[-]

TaylorAlexander 2 hours ago

Ah, that’s a good point thanks for the correction.

gruez 7 hours ago

Thanks. I wrongly focused on the headline result of the paper rather than the specific claim in the comment chain about "changing name, different results".

gruez 7 hours ago

To be fair, the claim wasn't that it always produced the wrong answer, just that there exists circumstances where it does. A pair of examples where it was correct hardly justifies a "demonstrably false" response.

[-]

thomashop 2 hours ago

Conversely, a pair of examples where it was incorrect hardly justifies the opposite response.

If you want a more scientific answer there is this recent paper: https://machinelearning.apple.com/research/gsm-symbolic

Workaccount2 9 hours ago

This is a relatively trivial task for current top models.

More challenging are unconventional story structures, like a mom named Matthew with a son named Mary and a daughter named William, who is Matthew's daughter?

But even these can still be done by the best models. And it is very unlikely there is much if any training data that's like this.

[-]

alexwebb2 9 hours ago

That's a neat example problem, thanks for sharing!

For anyone curious: https://chatgpt.com/share/6722d130-8ce4-800d-bf7e-c1891dfdf7...

> Based on traditional naming conventions, it seems that the names might have been switched in this scenario. However, based purely on your setup:

> Matthew has a daughter named William and a son named Mary.

> So, Matthew's daughter is William.

rileymat2 6 hours ago

How do people fair on unconventional structures? I am reminded of that old riddle involving a the mother being the doctor after a car crash.

[-]

adwn 2 hours ago

No idea why you've been downvoted, because that's a relevant and true comment. A more complex example would be the Monty Hall problem [1], for which even some very intelligent people will intuitively give the wrong answer, whereas symbolic reasoning (or Monte Carlo simulations) leads to the right conclusion.

[1] https://en.wikipedia.org/wiki/Monty_Hall_problem

jklinger410 9 hours ago

This is what kind of comments you make when your experience with LLMs is through memes.

vanviegen 2 hours ago

And yet, humans, our benchmark for AGI, suffer from similar problems, with our reasoning being heavily influenced by things that should have been unrelated.

https://en.m.wikipedia.org/wiki/Priming_(psychology)

_heimdall 7 hours ago

The whole design of an LLM is to consume and compress a huge space of human-generared content and use that to predict how a human would reply, one token at a time. That alone means the LLM isn't modelling anything beyond the human content it was trained on, and there is no reasoning since every prediction is based only on probabilities combined with controls similar to randomization factors used to avoid an entirely deterministic algorithm.

[-]

ricardobeat 6 hours ago

That’s not an accurate description. Attention / multi-head attention mechanisms allow the model to understand relationships between words far apart and their context.

They still lack, as far as we know, a world model, but the results are already eerily similar to how most humans seem to think - a lot of our own behaviour can be described as “predict how another human would reply”.

[-]

thomashop 2 hours ago

When trained on simple logs of Othello's moves, the model learns an internal representation of the board and its pieces. It also models the strength of its opponent.

https://arxiv.org/abs/2210.13382

I'd be more surprised if LLMs trained on human conversations don't create any world models. Having a world model simply allows the LLM to become better at sequence prediction. No magic needed.

There was another recent paper that shows that a language model is modelling things like age, gender, etc., of their conversation partner without having been explicitly trained for it

ChadNauseam 6 hours ago

For a lot of the content they were trained on, it seems like the easiest way to predict the next token would be to model the world or work with axioms. So how do we know that an LLM isn't doing these things internally?

[-]

thomashop 5 hours ago

In fact, it looks like the model is doing those things internally.

  We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato’s concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.

https://arxiv.org/html/2405.07987v5

chamomeal 6 hours ago

I totally agree that they’re a local maximum and they don’t seem like a path to AGI. But they’re definitely kinda reasoning systems, in the sense that they can somewhat reason about things. The whacky process they use to get there doesn’t take away from that IMO

alexwebb2 9 hours ago

If you expect "the right way" to be something _other_ than a system which can generate a reasonable "state + 1" from a "state" - then what exactly do you imagine that entails?

That's how we think. We think sequentially. As I'm writing this, I'm deciding the next few words to type based on my last few.

Blows my mind that people don't see the parallels to human thought. Our thoughts don't arrive fully formed as a god-given answer. We're constantly deciding the next thing to think, the next word to say, the next thing to focus on. Yes, it's statistical. Yes, it's based on our existing neural weights. Why are you so much more dismissive of that when it's in silicon?

[-]

Techonomicon 9 hours ago

Because we still don't know how the brain really does all it does in very specific terms, so why assume to know exactly how we think?

[-]

alexwebb2 9 hours ago

Why is there only one valid way of producing thoughts?

jltsiren 7 hours ago

Finite-state machines are a limited model. In principle, you can use them to model everything that can fit in the observable universe. But that doesn't mean they are a good model for most purposes.

The biggest limitation with the current LLMs is the artificial separation between training and inference. Once deployed, they are eternally stuck in the same moment, always reacting but incapable of learning. At best, they are snapshots of a general intelligence.

I also have a vague feeling that a fixed set of tokens is a performance hack that ultimately limits the generality of LLMs. That hardcoded assumptions make tasks that build on those assumptions easier and seeing past the assumptions harder.

[-]

alexwebb2 7 hours ago

> At best, they are snapshots of a general intelligence.

So are we, at any given moment.

Jensson 6 hours ago

> As I'm writing this, I'm deciding the next few words to type based on my last few.

If so you could have written this as a newborn baby, you are determining these words based on a lifetime of experience. LLMs doesn't do that, every instance of ChatGPT is the same newborn baby while a thousand clones of you could all be vastly different.

[-]

thomashop 5 hours ago

  We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato’s concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.

https://arxiv.org/html/2405.07987v5

wyldfire 7 hours ago

> It continues to astound me how much money is being dumped into these things.

Maybe in our society there's a surprising amount of value of a "word stirrer" intelligence. Sure, if it was confident when it was right and hesitant when it was wrong it'd be much better. Maybe humans are confidently wrong often enough that an artificial version that's compendious experience to draw on is groundbreaking.

csomar 7 hours ago

I am pretty sure Claude 3.5 Sonnet can reason or did reason with a particular snippet of code I was working on. I am not an expert in this area but my guessing is that these neural nets (made for language prediction) are being used for reasoning. But that’s not their optimal behavior (since they are token predictor). A big jump in reasoning will happen when reasoning is off loaded to an LRM.

Human brains are sure big but they are inefficient because a big portion of the brain is going to non-intelligence stuff like running the body internal organs, eye vision, etc…

I do agree that the money is not well spent. They should haver recognized that we are hitting s local maximum with the current model and funding should be going to academic/theoretical instead of dump brute force.

jsheard 9 hours ago

> So, LLMs face a regression on their latest proposed improvement.

Arguably a second regression, the first being cost, because COT improves performance by scaling up the amount of compute used at inference time instead of training time. The promise of LLMs was that you do expensive training once and then run the model cheaply forever, but now we're talking about expensive training followed by expensive inference every time you run the model.

[-]

TZubiri 8 hours ago

To be fair they also advanced in the cost aspect with other models

gpt4o and 4o mini have a tenth and a hundredth of inference cost of gpt4 respectively

idiotsecant 8 hours ago

LLMs are a local maximum in the same way that ball bearings can't fly. LLM-like engines will almost certainly be components of an eventual agi-level machine.

[-]

FuckButtons 8 hours ago

I don’t think that’s necessarily true, that presumes that the cobbled together assortment of machine learning algorithms we have now will somehow get agi, if we need a fundamentally different way of doing things there’s no reason to assume it will use a language model at all.

TZubiri 8 hours ago

I agree, my bet is that they will be used for NLP, and ML debugging/analysis.

alexchantavy 8 hours ago

This seems to support how thinking out loud during a coding test might make you do worse.

npunt 11 hours ago

"Don't overthink it" is sometimes good advice!

[-]

marviel 11 hours ago

I love backpropagating ideas from ML back into psychology :)

I think it shows great promise as a way to sidestep the ethical concerns (and the reproducibility issues) associated with traditional psychology research.

One idea in this space I think a lot about is from the Google paper on curiosity and procrastination in reinforcement learning: https://research.google/blog/curiosity-and-procrastination-i...

Basically the idea is that you can model curiosity as a reward signal proportional to your prediction error. They do an experiment where they train an ML system to explore a maze using curiosity, and it performs the task more efficiently -- UNTIL they add a "screen" in the maze that shows random images. In this case, the agent maximizes the curiosity reward by just staring at the screen.

Feels a little too relatable sometimes, as a highly curious person with procrastination issues :)

[-]

npunt 11 hours ago

"...in AI" will be the psychology equivalent of biology's "...in Mice"

[-]

marviel 11 hours ago

It will! Not 1:1, has issues, but gives hints.

Also much more scalable.

[-]

miningape 11 hours ago

> Not 1:1, has issues, but gives hints.

> Also much more scalable.

This same description could be applied to lab mice

[-]

Terr_ 11 hours ago

It'll probably be a ways before we start making shrines to their unwilling participation though.

https://en.wikipedia.org/wiki/Monument_to_the_laboratory_mou...

[-]

j_bum 5 hours ago

What would the shrine be of? An A100?

jeezfrk 11 hours ago

"Nerd sniping"

nisten 10 hours ago

This sounds about right from my experience getting nerdsniped by new samplers along with trying to reproduce the API middleware for the whole reflection thing, and using 4400 questions for a new benchmark is not bad given that even the well-regarded gpqa benchmark is only 3000-something questions.

What's ... mildly infuriating here is the lack of any kind of data, code, 0 mention of github in the paper, and nothing for anyone to reproduce or find any reason in my opinion to even recommend anyone to read this thing at all. If you think that whatever you're doing in the field of LLMs won't be obsolete in 6 months you're being delusional.

Anyway, back to the paper, it says all questions culminated to a yes or no answer... meaning theres a 50/50 chance of getting right, so does that mean the 8% drop in performance you got from testing llama 3 8b this way is more like 4% which would make it statistically insignificant? And given that the only other scientifically usueful & reproducible (non-api walled models which no one knows on how many actual llms and retrieval systems are composing that solution you're testing)models were less than that leads me to the opinion that this whole thing was just useless slop.

So please, if you're writing a paper in LLMs, and want to seem credible, either have some type of demo thing or show the actual god damn trash code and top secret garbage data you wrote for it so people can make some kind of use of it before it goes obsolete otherwise you're just wasting everyones time.

TL:DR. It's trash.

veryfancy 11 hours ago

So like dating?

m3kw9 12 hours ago

would be slow to use COT on simple requests like 1+1