LLMs are parameter based representations of linguistic representations of the world. Relative to robot predictive control problems, they are low dimensional and static. They are batch trained using supervised learning and are not designed to manage real time shifts in the external world or the reward space. They work because they operate in abstract, rule governed spaces like language and mathematics. They are ill suited to predictive control tasks. They are the IBM 360s of AI. Even so, they are astonishing achievements.
LeCun is right to say that continuous self supervised (hierarchical) learning is the next frontier, and that means we need world models. I'm not sure that JEPA is the right tool to get us past that frontier, but at the moment there are not a lot of alternatives on the table.
In "From Words to Worlds: Spatial Intelligence is AI’s Next Frontier" Li states directly "I’m not a philosopher", proceeds to make a philosophical argument that elevates visual perception as basis for evolution of intelligence.
I'm sure there are other valid reasons, but I think the most obvious one is that LLMs are not improving as fast as money asks for so we're moving to the next buzzword.
Danijar Hafner just left DeepMind. He's behind the Dreamer series of models which are IMO the most promising direction for world models anyone has come up with yet. I'm wondering where he's headed. Maybe he could end up at LeCun's startup?
In Dreamer 4 they are able to train an agent to play Minecraft with enough skill to obtain diamonds, without ever playing the game at all. Only by watching humans play. They first build a world model, then train the agent purely in scenarios imagined by the world model, requiring zero extra data or experience. Hopefully it's obvious how generating data from a world model might be useful for training agents in domains where we don't have datasets like the entire internet just sitting around ready-made for us to use.
If I was smarter, I would have predicted that not only would everyone else figure out that world models are a critical step, but that as a direct consequence the term "world model" would lose all meaning. Maybe next time. That said, Le Cunn's concept in the blog post is the only one worthy of the title.
The naming collision here is unfortunate since the two kinds of models described couldn't be any more different in purpose. Maybe JEPA-type world models should explicitly be called "predictive world models".
Control theory and cog-sci are impaired ideas. There is no mind, and cog sci is a post hoc retrofit narrated onto brains, rather than experience as events integrated. Cog sci is words sportscasting synthetic categories.
LeCun's model will fail as the idea of world model is oxymoronic, brains don't need them and the world isn't modeled, all models are wrong, the world is experienced instantaneously in optic flow that's built atop of olfaction.
The first link is about how philosophy and psychology is used to describe brain-cognitive behavior research, which has a limited explanatory capability compared to a hypothetical interpretation using its own vocabulary instead of those borrowed from other fields.
The second link is about an AI that detects consciousness in coma patients.
The third link is about how coma is associated with a low-complexity and high-predictability passive cortical state. Kickstarting the brain to a high-complexity and low-predictability state of cortical dynamics is a sign of recovery back to consciousness.
When you thought to yourself, "I think therefore I am," in what language did you think it? In English? The English language is an artifact of a community of English speakers. You can't have a language with grammatical rules without a community of speakers to make that language.
Almost nobody in the English-speaking community has direct access to the internals of your mind. The community learns things through consensus, e.g. via the scientific method. We know things in English via a community of English-speaking scientists, journalists, historians, etc. etc. Wittgenstein calls these the "structures of life," the ordinary day-to-day work we do to figure out what's true and false, likely and unlikely.
As you're probably aware, the scientific method has long struggled to find a "mind" in the brain doing the thinking; all we can find are just atoms, molecules, neurons, doing things, having behaviors. We can't find "thoughts" in the atoms. As far as our ordinary day-to-day scientific method is concerned, we can't find a "mind."
But "cogito ergo sum" isn't part of the scientific method. We don't believe "cogito ergo sum" because reproducible experiments have shown it to be true. "Cogito ergo sum" proposes a way of knowing disconnected from the messy structures of life we use in English.
So, perhaps you'd say, "oh, good point, I suppose I didn't think 'cogito ergo sum' in English or Latin or whatever, I thought it in a private language known only to me. From this vantage point, I only have direct knowledge of my own existence and my own perceptions in the present moment (since the past is uncertain), but at least I can have 100% certainty of my own existence in that language."
The problem is, you really can't have a private language, not a language with words (terms) and grammatical rules and logical inferences.
Suppose you assigned a term S to a particular sensation you're having right now. What are the rules of S? What is S and what is not S? Are there any rules for how to use S? How would you know? How would you enforce those rules over time? In a private language, there's no difference between using the term S "correctly" or "incorrectly." There are no rules in a private language; there can't be. Even mathematical proofs are impossible when every term in the proof means anything you want.
Descartes didn't originally write "cogito ergo sum" in Latin. He originally published it in French, "je pense, donc je suis." But in Europe, where Descartes was writing, Latin was the universal language, the one known to all sorts of people across the continent. For Descartes, Latin was the language of empire, the language every civilized person knew because their ancestors were forced to learn it at the point of a sword, the language of absolutes.
Wittgenstein has a famous line, "Whereof one cannot speak, thereof one must be silent." So must we be silent about "cogito ergo sum." "cogito ergo sum" isn't valid in Latin; "je pense, donc je suis" isn't valid in French. It could only be valid in an unspeakable private language, a language with no grammatical rules, no logic, where true and false are indistinguishable. "Cogito ergo sum" could only be valid in an unusable language where everything is meaningless.
Neuro is the experience integrating allo/egocentric. We've already crossed that threshold in vision depth meets allocortex behaviors in entertainment. Ie there's more intelligence in The Shining than anything in current folk science AI/cog sci. It's a resounding flop, so will the Gaussian and the psychobabble of LeCuns as it is a psychological approach.
The meaning is not external in the models, that’s how they’re collective bunk. The meanings are task variable only, they’re wordless and only tied to actions in events. We have a long way to go.
And the pendulum swings back toward representation. It is becoming clear that the LLM approach is not adequate to reach what John McCarthy called human-level intelligence:
Between us and human-level intelligence lie many problems. They can be summarized as that of succeeding in the "common-sense informatic situation". [1]
I always felt like one of reasons LLMs are so good is that they piggyback on the many years that have gone into developing language as an information representation/compression format. I don’t know if there’s anything similar a world model can take advantage of.
That being said there have been models which are pretty effective at other things that don’t use language, so maybe it’s a non issue.
With all due respect, AI is ultimately a capital game. World models aren’t where real B2B customer revenue comes from—at least compared to today’s LLMs; they’re mainly a better story for raising huge amounts of private capital. Hopefully they figure out how to build the next-gen AI architecture along the way.
The most useful models are image, video, and audio models. It makes sense that we'd make the video models more 4D aware.
Text really hogged all the attention. Media is where AI is really going to shine.
Some of the most profitable models right now are in music, image, and video generation. A lot of people are having a blast doing things they could legitimately never do before, and real working professionals are able to use the tools to get 1000x more done - perhaps providing a path to independence from bigger studios, and certainly more autonomy for those not born into nepotism.
As long as companies don't over-raise like OpenAI, there should be a smooth gradient from next gen media tools to revolutionary future stuff like immersive VR worlds that you can bend like the Matrix or Holodeck.
And I'll just be exceedingly chuffed if we get open source and highly capable world models from the Chinese that keep us within spitting distance of the unicorns.
Fundamentally what AGI is trying to do is to encode ability to logic and reason. Tokens, images, video and audio are all just information of different entropy density that is the output of that logic reasoning process or emulation of logic reasoning process.
AI might be the biggest transfer of wealth from the rich to the poor in history. Billions have been poured into closed sourced models which have led directly and indirectly to open weight models being available to everyone.
Pretty similar to social media in a lot of ways. They've strip mined the commons and provided us a corporate controlled walled garden to compensate us for our loss.
I played with Marble yesterday, Fei-Fei/World Labs' new product.
It is the most impressed I've been with an AI experience since the first time I saw a model one-shot material code.
Sure, its an early product. The visual output reminds me a lot of early SDXL. But just look at what's happened to video in the last year and image in the last three. The same thing is going to happen here, and fast, and I see the vision for generative worlds for everything from gaming/media to education to RL/simulation.
I wasn't actually able to use it because the servers were overloaded. What exactly impressed you (or more generally, what does it actually let you do at the moment?).
What you get is a 3D room based on the prompt/image. It rewrites your prompt to a specific format. Overall the rooms tend to be detailed and imaginative.
Then you can fly around the room like in Minecraft creative mode. Really looking forward to more editing features/infill to augment this.
Everytime I see LeCun talk about world models, I can’t help but think it is also just a tweak on the fundamentals of what is behind current LLM technology. In the end it’s still neural networks. To me, having to “teach” the model how physics works makes me think it can’t be true AGI either.
A trillion dollars are now riding on that white whale. An entire naval fleet is being raised for the purposes of chasing down that whale. LeCun and Fei-Fei merely believe that the whale is in a different ocean.
I think video and agentic and multimodal models have led to this point, but actually making a world model may provide to be long and difficult.
I feel LeCun is correct that LLMs as of now have limitations where it needs an architectural overhaul. LLMs now have a problem with context rot, and this would hamper with an effective world model if the world disintegrates and becomes incoherent and hallucinated over time.
It'd doubtful whether investors would be in for the long haul, which may explain the behavior of Sam Altman in seeking government support. The other approaches described in this article may be more investor friendly as there is a more immediate return with creating a 3D asset or a virtual simulation.
Because they are smart enough to realize current LLM tech is nearing a dead end and cannot serve as a full AGI, even ignoring context and hallucination issues, without actual knowledge of the real world.
The LLM grift is burned up, so this is the next thing. It has just enough new magic tricks to wow the VCs who don't really get what's going on here. I think this comment from the article says it all:
“Taking images and turning them into 3D environments using gaussian splats, depth and inpainting. Cool, but that’s a 3D GS pipeline, not a robot brain.”
One problem with VR and VFX is how expensive it is in terms of man hours to create immersive worlds. This significantly reduces the cost and has applications in all sorts of ways and could realistically improve the availability of content in VR and reduce movie production costs. And that’s just the obvious applications (ignoring that these world models can be used to train AI itself)
LLMs are parameter based representations of linguistic representations of the world. Relative to robot predictive control problems, they are low dimensional and static. They are batch trained using supervised learning and are not designed to manage real time shifts in the external world or the reward space. They work because they operate in abstract, rule governed spaces like language and mathematics. They are ill suited to predictive control tasks. They are the IBM 360s of AI. Even so, they are astonishing achievements.
LeCun is right to say that continuous self supervised (hierarchical) learning is the next frontier, and that means we need world models. I'm not sure that JEPA is the right tool to get us past that frontier, but at the moment there are not a lot of alternatives on the table.
In "From Words to Worlds: Spatial Intelligence is AI’s Next Frontier" Li states directly "I’m not a philosopher", proceeds to make a philosophical argument that elevates visual perception as basis for evolution of intelligence.
I'm sure there are other valid reasons, but I think the most obvious one is that LLMs are not improving as fast as money asks for so we're moving to the next buzzword.
Danijar Hafner just left DeepMind. He's behind the Dreamer series of models which are IMO the most promising direction for world models anyone has come up with yet. I'm wondering where he's headed. Maybe he could end up at LeCun's startup?
In Dreamer 4 they are able to train an agent to play Minecraft with enough skill to obtain diamonds, without ever playing the game at all. Only by watching humans play. They first build a world model, then train the agent purely in scenarios imagined by the world model, requiring zero extra data or experience. Hopefully it's obvious how generating data from a world model might be useful for training agents in domains where we don't have datasets like the entire internet just sitting around ready-made for us to use.
https://danijar.com/project/dreamer4/
If I was smarter, I would have predicted that not only would everyone else figure out that world models are a critical step, but that as a direct consequence the term "world model" would lose all meaning. Maybe next time. That said, Le Cunn's concept in the blog post is the only one worthy of the title.
The naming collision here is unfortunate since the two kinds of models described couldn't be any more different in purpose. Maybe JEPA-type world models should explicitly be called "predictive world models".
Control theory and cog-sci are impaired ideas. There is no mind, and cog sci is a post hoc retrofit narrated onto brains, rather than experience as events integrated. Cog sci is words sportscasting synthetic categories.
LeCun's model will fail as the idea of world model is oxymoronic, brains don't need them and the world isn't modeled, all models are wrong, the world is experienced instantaneously in optic flow that's built atop of olfaction.
https://www.eneuro.org/content/7/4/ENEURO.0069-20.2020
Any real AI that veers at control will have to adopt a neurobio path
https://tbrnewsmedia.com/sbus-sima-mofakham-chuck-mikell-des...
That's built paradoxically from unpredictability
https://pubmed.ncbi.nlm.nih.gov/38579270/
Does Stevie Wonder not experience the world since he's blind and anosmic?
The first link is about how philosophy and psychology is used to describe brain-cognitive behavior research, which has a limited explanatory capability compared to a hypothetical interpretation using its own vocabulary instead of those borrowed from other fields.
The second link is about an AI that detects consciousness in coma patients.
The third link is about how coma is associated with a low-complexity and high-predictability passive cortical state. Kickstarting the brain to a high-complexity and low-predictability state of cortical dynamics is a sign of recovery back to consciousness.
How does any of this support what you have said?
> There is no mind
Interesting. What is your response to the cogito?
The canonical 20th-century response to "cogito ergo sum" is Wittgenstein's "private-language argument." https://plato.stanford.edu/entries/private-language/
When you thought to yourself, "I think therefore I am," in what language did you think it? In English? The English language is an artifact of a community of English speakers. You can't have a language with grammatical rules without a community of speakers to make that language.
Almost nobody in the English-speaking community has direct access to the internals of your mind. The community learns things through consensus, e.g. via the scientific method. We know things in English via a community of English-speaking scientists, journalists, historians, etc. etc. Wittgenstein calls these the "structures of life," the ordinary day-to-day work we do to figure out what's true and false, likely and unlikely.
As you're probably aware, the scientific method has long struggled to find a "mind" in the brain doing the thinking; all we can find are just atoms, molecules, neurons, doing things, having behaviors. We can't find "thoughts" in the atoms. As far as our ordinary day-to-day scientific method is concerned, we can't find a "mind."
But "cogito ergo sum" isn't part of the scientific method. We don't believe "cogito ergo sum" because reproducible experiments have shown it to be true. "Cogito ergo sum" proposes a way of knowing disconnected from the messy structures of life we use in English.
So, perhaps you'd say, "oh, good point, I suppose I didn't think 'cogito ergo sum' in English or Latin or whatever, I thought it in a private language known only to me. From this vantage point, I only have direct knowledge of my own existence and my own perceptions in the present moment (since the past is uncertain), but at least I can have 100% certainty of my own existence in that language."
The problem is, you really can't have a private language, not a language with words (terms) and grammatical rules and logical inferences.
Suppose you assigned a term S to a particular sensation you're having right now. What are the rules of S? What is S and what is not S? Are there any rules for how to use S? How would you know? How would you enforce those rules over time? In a private language, there's no difference between using the term S "correctly" or "incorrectly." There are no rules in a private language; there can't be. Even mathematical proofs are impossible when every term in the proof means anything you want.
Descartes didn't originally write "cogito ergo sum" in Latin. He originally published it in French, "je pense, donc je suis." But in Europe, where Descartes was writing, Latin was the universal language, the one known to all sorts of people across the continent. For Descartes, Latin was the language of empire, the language every civilized person knew because their ancestors were forced to learn it at the point of a sword, the language of absolutes.
Wittgenstein has a famous line, "Whereof one cannot speak, thereof one must be silent." So must we be silent about "cogito ergo sum." "cogito ergo sum" isn't valid in Latin; "je pense, donc je suis" isn't valid in French. It could only be valid in an unspeakable private language, a language with no grammatical rules, no logic, where true and false are indistinguishable. "Cogito ergo sum" could only be valid in an unusable language where everything is meaningless.
Thereof, we must remain silent.
All abstraction of reality are bound to fail, but some abstractions are more convincing (or indeed more useful) than others.
> Any real AI that veers at control will have to adopt a neurobio path
Maybe. Or maybe it's a useless distraction. Only time will tell what signals are meaningful.
Neuro is the experience integrating allo/egocentric. We've already crossed that threshold in vision depth meets allocortex behaviors in entertainment. Ie there's more intelligence in The Shining than anything in current folk science AI/cog sci. It's a resounding flop, so will the Gaussian and the psychobabble of LeCuns as it is a psychological approach.
I think you place more meaning in "intelligence" than I do, and certainly more meaning in current models of the brain. We'll see.
The meaning is not external in the models, that’s how they’re collective bunk. The meanings are task variable only, they’re wordless and only tied to actions in events. We have a long way to go.
And the pendulum swings back toward representation. It is becoming clear that the LLM approach is not adequate to reach what John McCarthy called human-level intelligence:
Between us and human-level intelligence lie many problems. They can be summarized as that of succeeding in the "common-sense informatic situation". [1]
And the search continues...
[1] https://www-formal.stanford.edu/jmc/human.pdf
I always felt like one of reasons LLMs are so good is that they piggyback on the many years that have gone into developing language as an information representation/compression format. I don’t know if there’s anything similar a world model can take advantage of.
That being said there have been models which are pretty effective at other things that don’t use language, so maybe it’s a non issue.
I will gladly take $10B to find out for you.
With all due respect, AI is ultimately a capital game. World models aren’t where real B2B customer revenue comes from—at least compared to today’s LLMs; they’re mainly a better story for raising huge amounts of private capital. Hopefully they figure out how to build the next-gen AI architecture along the way.
By capital game, do you mean money investment game or market ruler's game?
> World models aren’t where real B2B customer revenue comes from
You could say the same thing about AGI. Ultimately capital will realize intelligence is a drawback.
The most useful models are image, video, and audio models. It makes sense that we'd make the video models more 4D aware.
Text really hogged all the attention. Media is where AI is really going to shine.
Some of the most profitable models right now are in music, image, and video generation. A lot of people are having a blast doing things they could legitimately never do before, and real working professionals are able to use the tools to get 1000x more done - perhaps providing a path to independence from bigger studios, and certainly more autonomy for those not born into nepotism.
As long as companies don't over-raise like OpenAI, there should be a smooth gradient from next gen media tools to revolutionary future stuff like immersive VR worlds that you can bend like the Matrix or Holodeck.
And I'll just be exceedingly chuffed if we get open source and highly capable world models from the Chinese that keep us within spitting distance of the unicorns.
>Some of the most profitable models right now are in music, image, and video generation.
I don’t think many of the companies running these make a profit right now
>> The most useful models are image, video, and audio models
This is wrong. The vast majority of revenue is being generated by text models because they are so useful.
That just sounds like text with extra steps.
Fundamentally what AGI is trying to do is to encode ability to logic and reason. Tokens, images, video and audio are all just information of different entropy density that is the output of that logic reasoning process or emulation of logic reasoning process.
> Fundamentally what AGI is trying to do is to encode ability to logic and reason.
No? The Wason selection task has shown that logic and reason are not really core nor essential to human cognition.
It's really verging on speculation, but see chapter 2 of Jaynes 1976 - in particular the section on spatialization and the features of consciousness.
AI might be the biggest transfer of wealth from the rich to the poor in history. Billions have been poured into closed sourced models which have led directly and indirectly to open weight models being available to everyone.
At the cost of buying the poor's thoughts (training data)
Pretty similar to social media in a lot of ways. They've strip mined the commons and provided us a corporate controlled walled garden to compensate us for our loss.
Open weight models aren’t worth very much money to most people.
I played with Marble yesterday, Fei-Fei/World Labs' new product.
It is the most impressed I've been with an AI experience since the first time I saw a model one-shot material code.
Sure, its an early product. The visual output reminds me a lot of early SDXL. But just look at what's happened to video in the last year and image in the last three. The same thing is going to happen here, and fast, and I see the vision for generative worlds for everything from gaming/media to education to RL/simulation.
I wasn't actually able to use it because the servers were overloaded. What exactly impressed you (or more generally, what does it actually let you do at the moment?).
You give it a text prompt and optional image.
What you get is a 3D room based on the prompt/image. It rewrites your prompt to a specific format. Overall the rooms tend to be detailed and imaginative.
Then you can fly around the room like in Minecraft creative mode. Really looking forward to more editing features/infill to augment this.
Marble appears to be like HunyuanWorld to me, but this time they marketed it as a first step to a world model, and it has multimodal capabilities.
Everytime I see LeCun talk about world models, I can’t help but think it is also just a tweak on the fundamentals of what is behind current LLM technology. In the end it’s still neural networks. To me, having to “teach” the model how physics works makes me think it can’t be true AGI either.
I don’t know enough about this to be sure, but this feels like a white whale.
Human-level language was a white whale just a few years ago.
A.L.I.C.E. was published in '95.
A trillion dollars are now riding on that white whale. An entire naval fleet is being raised for the purposes of chasing down that whale. LeCun and Fei-Fei merely believe that the whale is in a different ocean.
I think video and agentic and multimodal models have led to this point, but actually making a world model may provide to be long and difficult.
I feel LeCun is correct that LLMs as of now have limitations where it needs an architectural overhaul. LLMs now have a problem with context rot, and this would hamper with an effective world model if the world disintegrates and becomes incoherent and hallucinated over time.
It'd doubtful whether investors would be in for the long haul, which may explain the behavior of Sam Altman in seeking government support. The other approaches described in this article may be more investor friendly as there is a more immediate return with creating a 3D asset or a virtual simulation.
Le Cunn's talk at Harvard informs how far behind he is.
How so?
Whether or not this is exactly the same thing, I find this glossary entry from NVIDIA interesting: https://www.nvidia.com/en-us/glossary/world-models/
Because they are smart enough to realize current LLM tech is nearing a dead end and cannot serve as a full AGI, even ignoring context and hallucination issues, without actual knowledge of the real world.
Most world models so far are based on transformers, no?
Earlier: https://news.ycombinator.com/item?id=45914363
The LLM grift is burned up, so this is the next thing. It has just enough new magic tricks to wow the VCs who don't really get what's going on here. I think this comment from the article says it all:
“Taking images and turning them into 3D environments using gaussian splats, depth and inpainting. Cool, but that’s a 3D GS pipeline, not a robot brain.”
One problem with VR and VFX is how expensive it is in terms of man hours to create immersive worlds. This significantly reduces the cost and has applications in all sorts of ways and could realistically improve the availability of content in VR and reduce movie production costs. And that’s just the obvious applications (ignoring that these world models can be used to train AI itself)
It’s for the VCs who missed out early. Now’s their chance!