Just had a fantastic experience applying agentic coding to CAD. I needed to add some threads to a few blanks in a 3d print. I used computational geometry to give the agent a way to "feel" around the model. I had it convolve a sphere of the radius of the connector across the entire model. It was able to use this technique to find the precise positions of the existing ports and then add threads to them. It took a few tries to get right, but if I had the technique in mind before it would be very quick. The lesson for me is that the models need to have a way to feel. In the end, the implementation of the 3d model had to be written in code, where it's auditable. Perhaps if the agent were able to see images directly and perfectly, I never would have made this discovery.
From reading that, I'm not quite sure if they have anything figured out.
I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.
There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience.
Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.
I wrote this before the huge LLM explosion and I still personally believe it is the path forward.
> Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.
Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.
This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?
This is essentially a simulation system for operating on narrowly constrained virtual worlds. It is pretty well-understood that these don't translate to learning non-trivial dynamics in the physical world, which is where most of the interesting applications are.
While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.
A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.
Genie 3 (at a prototype level) achieves the goal she describes: a controllable world model with consistency and realistic physics. Its sibling Veo 3 even demonstrates some [spatial problem-solving ability](https://video-zero-shot.github.io/). Genie and Veo are definitely closer to her vision than anything World Labs has released publicly.
However, she does not mention Google's models at all. This omission makes the blog feel very much like an ad for her company rather than a good-faith guide for the field.
I think I perceive a massive bottleneck. Today's incarnation of AI learns from the web, not from the interaction with the humans it talks to. And for sure there is a lot of value there, it is just pointless to see that interaction lost a few hundred or thousand words of context later. For humans their 'context' is their life and total memory capacity, that's why we learn from the interaction with other, more experienced humans. It is always a two way street. But with AI as it is, it is a one way street, one that means that your interaction and your endless corrections when it gets stuff wrong (again) is lost. Allowing for a personalized massive context would go a long way towards improving the value here, at least like that you - hopefully - only have to make the same correction once.
I do wonder if this will meaningfully move the needle on agent assistants (coding, marketing, schedule my vacation, etc...) considering how much more compute (I would imagine) is needed for video / immersive environments during training and inference
I suspect the calculus is more favorable for robotics
Also good context here is Friston’s Free Energy Principle: A unified theory suggesting that all living systems, from simple organisms to the brain, must minimize "surprise" to maintain their form and survive. To do this, systems act to minimize a mathematical quantity called variational free energy, which is an upper bound on surprise. This involves constantly making predictions about the world, updating internal models based on sensory data, and taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors.
Key distinction: Constant and continuous updating. I.e. feedback loops with observation, prediction, action (agency), and once more, observation.
It should have survival and preservation as a fundamental architectural feature.
Personally, I think the direction AI will go towards is having an AI brain with something like a LLM at its core augmented with various abilities like spatial intelligence, rather than models being designed with spatial reasoning at its core. Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning. Similar to how image generation models have struggled with generating the right number of fingers on hands, I would expect a world model designed to model physical space to not generalize the understanding of simple human ideas.
> Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning
I believe the zero hypothesis would be that a model natively understanding both would work best/come closest to human intelligence (and possibly other different modalities are also needed).
Also, as a complete laymen, our language having several interconnections with spatial concepts would also point towards a multi-modal intelligence. (topic: place, subject: lying under or near, respect/prospect: look back/ahead, etc). In my understanding these connections only secondarily make their way into LLM's representations.
There's a difference between what a model is trained on and the inductive biases a model uses to generalize. It isn't as simple as saying training natively on everything. All existing models have certain things they generalize better and certain things they don't generalize due to their model architecture, and the architecture of world models I've seen don't seem as capable of universally generalizing as LLMs.
Underrated and unsung. Fei Fei Li first launched ImageNet way back in 2007, a hugely influential move sparking much of the computer vision deep learning that followed since. I remember in a lecture about 7 years ago jph00 saying "text is just waiting for its imagenet moment" -> then came the gpt explosion. Fei Fei was massively instrumental in where we are today.
Curating a dataset is vastly different than introducing a new architectural approach. ImageNet is a database. Its not like inventing the convolutions for CNNs or the LSTM or a Transformer.
Exactly right. Neatly said by the author in the linked article.
> I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs).
Datasets + NNs + GPUs. Three "vastly different" advances that came together. ImageNet was THE dataset.
I'd imagine Tesla's and Waymo's AI are at the forefront of spatial cognition... this is what has made me hesitant to dismiss the AI hype as a bubble. Once spatial cognition is solved to the extent that language has been solved, a range of applications currently unavailable will drive a tidal wave of compute demand. Beyond self driving, think fully autonomous drone swarms... Militaries around the world certainly are and they're salivating.
The automotive AIs are narrow pseudo-spatial models that are good at extracting spatial features from the environment to feed fairly simple non-spatial models. They don't really reason spatially in the same sense that an animal does. A tremendous amount of human cognitive effort goes into updating the maps that these systems rely on.
Help me understand - my mental model of how auto AI work is that they're using neural nets to process visual information and output a decision on where to move in relation to objects in the world around them. Yes they are moving in a constrained 2D space but is that not fundamentally what animals do?
What you're describing is what's known as an "end to end" model that takes in image pixels and outputs steering and throttle commands. What happens in an AV is that a bunch of ML models produce input for software written by human engineers, and so the output doesn't come from an entirely ML system, it's a mix of engineered and trained components for various identifiable tasks (perception, planning, prediction, controls).
100% agree but not just military. Self-driving vehicles will become the norm, robots to mow the lawn, clean the house, eventually humanoids that can interact like LLMs and be functional robots that help out around the house.
Far too much marketing speech, far too little math or theory, and completely misses the mark on the 'next frontier'. Maybe four years ago, spatial reasoning was the problem to solve, but by 2022 it was solved. All that remained was scaling up. The actual three next problems to solve (in order of when they will be solved) are:
If there isn't a path humans know how to take with their current technology, it isn't a solved problem. It's much different than people training an image model for research purposes, and knowing that $100m in compute is probably enough for a basic video model.
Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.
I enjoy Fei-fei li's communication style. It's straight and to the point in a way that I find very easy to parse. She's one of my primary idols in the AI space these days.
Just had a fantastic experience applying agentic coding to CAD. I needed to add some threads to a few blanks in a 3d print. I used computational geometry to give the agent a way to "feel" around the model. I had it convolve a sphere of the radius of the connector across the entire model. It was able to use this technique to find the precise positions of the existing ports and then add threads to them. It took a few tries to get right, but if I had the technique in mind before it would be very quick. The lesson for me is that the models need to have a way to feel. In the end, the implementation of the 3d model had to be written in code, where it's auditable. Perhaps if the agent were able to see images directly and perfectly, I never would have made this discovery.
From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.
There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.
> Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.
Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.
https://www.nobelprize.org/prizes/medicine/2014/press-releas...
It's not enough by a long shot. Placement isn't related directly to vicarious trial and error, path integrations, sequence generation.
There's a whole giant gap between grid cells and intelligence.
This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?
There is research showing that the grid cells also represent abstract reasoning: https://pmc.ncbi.nlm.nih.gov/articles/PMC5248972/
Deep Mind also did a paper with grid cells a while ago: https://deepmind.google/blog/navigating-with-grid-like-repre...
This is essentially a simulation system for operating on narrowly constrained virtual worlds. It is pretty well-understood that these don't translate to learning non-trivial dynamics in the physical world, which is where most of the interesting applications are.
While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.
A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.
Genie 3 (at a prototype level) achieves the goal she describes: a controllable world model with consistency and realistic physics. Its sibling Veo 3 even demonstrates some [spatial problem-solving ability](https://video-zero-shot.github.io/). Genie and Veo are definitely closer to her vision than anything World Labs has released publicly.
However, she does not mention Google's models at all. This omission makes the blog feel very much like an ad for her company rather than a good-faith guide for the field.
I think I perceive a massive bottleneck. Today's incarnation of AI learns from the web, not from the interaction with the humans it talks to. And for sure there is a lot of value there, it is just pointless to see that interaction lost a few hundred or thousand words of context later. For humans their 'context' is their life and total memory capacity, that's why we learn from the interaction with other, more experienced humans. It is always a two way street. But with AI as it is, it is a one way street, one that means that your interaction and your endless corrections when it gets stuff wrong (again) is lost. Allowing for a personalized massive context would go a long way towards improving the value here, at least like that you - hopefully - only have to make the same correction once.
Holy marketing
I do wonder if this will meaningfully move the needle on agent assistants (coding, marketing, schedule my vacation, etc...) considering how much more compute (I would imagine) is needed for video / immersive environments during training and inference
I suspect the calculus is more favorable for robotics
her company world labs is at the forefront of building spatial intelligence models
So she says
Also good context here is Friston’s Free Energy Principle: A unified theory suggesting that all living systems, from simple organisms to the brain, must minimize "surprise" to maintain their form and survive. To do this, systems act to minimize a mathematical quantity called variational free energy, which is an upper bound on surprise. This involves constantly making predictions about the world, updating internal models based on sensory data, and taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors.
Key distinction: Constant and continuous updating. I.e. feedback loops with observation, prediction, action (agency), and once more, observation.
It should have survival and preservation as a fundamental architectural feature.
Personally, I think the direction AI will go towards is having an AI brain with something like a LLM at its core augmented with various abilities like spatial intelligence, rather than models being designed with spatial reasoning at its core. Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning. Similar to how image generation models have struggled with generating the right number of fingers on hands, I would expect a world model designed to model physical space to not generalize the understanding of simple human ideas.
> Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning
I believe the zero hypothesis would be that a model natively understanding both would work best/come closest to human intelligence (and possibly other different modalities are also needed).
Also, as a complete laymen, our language having several interconnections with spatial concepts would also point towards a multi-modal intelligence. (topic: place, subject: lying under or near, respect/prospect: look back/ahead, etc). In my understanding these connections only secondarily make their way into LLM's representations.
There's a difference between what a model is trained on and the inductive biases a model uses to generalize. It isn't as simple as saying training natively on everything. All existing models have certain things they generalize better and certain things they don't generalize due to their model architecture, and the architecture of world models I've seen don't seem as capable of universally generalizing as LLMs.
Sutton: Reinforcement Learning
LeCun: Energy Based Self-Supervised Learning
Chollet: Program Synthesis
Fei-Fei: ???
Are there any others with hot takes on the future architectures and techniques needed for of A-not-quite-G-I?
> Fei-Fei: ???
Underrated and unsung. Fei Fei Li first launched ImageNet way back in 2007, a hugely influential move sparking much of the computer vision deep learning that followed since. I remember in a lecture about 7 years ago jph00 saying "text is just waiting for its imagenet moment" -> then came the gpt explosion. Fei Fei was massively instrumental in where we are today.
Curating a dataset is vastly different than introducing a new architectural approach. ImageNet is a database. Its not like inventing the convolutions for CNNs or the LSTM or a Transformer.
CNNs and Transformers are both really simple and intuitive so I don't think there is any stroke of genius in how they were devised.
Their success is due to datasets and the tooling that allowed models to be trained on large amounts of data, sufficiently fast using GPU clusters.
Exactly right. Neatly said by the author in the linked article.
> I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs).
Datasets + NNs + GPUs. Three "vastly different" advances that came together. ImageNet was THE dataset.
I'd imagine Tesla's and Waymo's AI are at the forefront of spatial cognition... this is what has made me hesitant to dismiss the AI hype as a bubble. Once spatial cognition is solved to the extent that language has been solved, a range of applications currently unavailable will drive a tidal wave of compute demand. Beyond self driving, think fully autonomous drone swarms... Militaries around the world certainly are and they're salivating.
The automotive AIs are narrow pseudo-spatial models that are good at extracting spatial features from the environment to feed fairly simple non-spatial models. They don't really reason spatially in the same sense that an animal does. A tremendous amount of human cognitive effort goes into updating the maps that these systems rely on.
Help me understand - my mental model of how auto AI work is that they're using neural nets to process visual information and output a decision on where to move in relation to objects in the world around them. Yes they are moving in a constrained 2D space but is that not fundamentally what animals do?
What you're describing is what's known as an "end to end" model that takes in image pixels and outputs steering and throttle commands. What happens in an AV is that a bunch of ML models produce input for software written by human engineers, and so the output doesn't come from an entirely ML system, it's a mix of engineered and trained components for various identifiable tasks (perception, planning, prediction, controls).
Tesla's problems with their multi camera non-Lidar system is precisely because they don't have any spacial cognition.
100% agree but not just military. Self-driving vehicles will become the norm, robots to mow the lawn, clean the house, eventually humanoids that can interact like LLMs and be functional robots that help out around the house.
Spatial cognition really means "autonomous robot," and nobody thinks Tesla or Waymo have the most advanced robots.
"Invest in my startup"
Before the music stops
Far too much marketing speech, far too little math or theory, and completely misses the mark on the 'next frontier'. Maybe four years ago, spatial reasoning was the problem to solve, but by 2022 it was solved. All that remained was scaling up. The actual three next problems to solve (in order of when they will be solved) are:
- Reinforcement Learning (2026)
- General Intelligence (2027)
- Continual Learning (2028)
EDIT: lol, funny how the idiots downvote
Combinatorial search is also a solved problem. We just need a couple of Universes to scale it up.
If there isn't a path humans know how to take with their current technology, it isn't a solved problem. It's much different than people training an image model for research purposes, and knowing that $100m in compute is probably enough for a basic video model.
Hasn't RLHF and with LLM feedback been around for years now
Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.
We might not even need RL as DPO has shown.
> if you purely use policy optimization, RLHF will be biased towards short horizons
> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly
What do you consider "General Intelligence" to be?
A good start would be:
1. Robust to adversarial attacks (e.g. in classification models or LLM steering).
2. Solving ARC-AGI.
Current models are optimized to solve the current problem they're presented, not really find the most general problem-solving techniques.
I like to think I'm generally intelligent, but I am not robust to adversarial attacks.
Edit: I'm trying arc-agi tests now and it's looking bad for me: https://arcprize.org/play?task=e3721c99
In my thinking what AI lacks is a memory system
That has been solved with RAG, OCR-ish image encoding (deepseek recently) and just long context windows in general.
Not really. For example we still can’t get coding agents to work reliably, and I think it’s a memory problem, not a capabilities problem.
I enjoy Fei-fei li's communication style. It's straight and to the point in a way that I find very easy to parse. She's one of my primary idols in the AI space these days.