Spatial intelligence is AI’s next frontier

(drfeifei.substack.com)

83 points | by mkirchner 2 hours ago ago

45 comments

  • inciampati 2 minutes ago

    Just had a fantastic experience applying agentic coding to CAD. I needed to add some threads to a few blanks in a 3d print. I used computational geometry to give the agent a way to "feel" around the model. I had it convolve a sphere of the radius of the connector across the entire model. It was able to use this technique to find the precise positions of the existing ports and then add threads to them. It took a few tries to get right, but if I had the technique in mind before it would be very quick. The lesson for me is that the models need to have a way to feel. In the end, the implementation of the 3d model had to be written in code, where it's auditable. Perhaps if the agent were able to see images directly and perfectly, I never would have made this discovery.

  • toisanji an hour ago

    From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

    There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.

    • bonsai_spool an hour ago

      > Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.

      Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.

      https://www.nobelprize.org/prizes/medicine/2014/press-releas...

      • Marshferm 31 minutes ago

        It's not enough by a long shot. Placement isn't related directly to vicarious trial and error, path integrations, sequence generation.

        There's a whole giant gap between grid cells and intelligence.

    • byearthithatius an hour ago

      This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?

  • jandrewrogers an hour ago

    This is essentially a simulation system for operating on narrowly constrained virtual worlds. It is pretty well-understood that these don't translate to learning non-trivial dynamics in the physical world, which is where most of the interesting applications are.

    While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.

    A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.

  • in-silico 7 minutes ago

    Genie 3 (at a prototype level) achieves the goal she describes: a controllable world model with consistency and realistic physics. Its sibling Veo 3 even demonstrates some [spatial problem-solving ability](https://video-zero-shot.github.io/). Genie and Veo are definitely closer to her vision than anything World Labs has released publicly.

    However, she does not mention Google's models at all. This omission makes the blog feel very much like an ad for her company rather than a good-faith guide for the field.

  • jacquesm 19 minutes ago

    I think I perceive a massive bottleneck. Today's incarnation of AI learns from the web, not from the interaction with the humans it talks to. And for sure there is a lot of value there, it is just pointless to see that interaction lost a few hundred or thousand words of context later. For humans their 'context' is their life and total memory capacity, that's why we learn from the interaction with other, more experienced humans. It is always a two way street. But with AI as it is, it is a one way street, one that means that your interaction and your endless corrections when it gets stuff wrong (again) is lost. Allowing for a personalized massive context would go a long way towards improving the value here, at least like that you - hopefully - only have to make the same correction once.

  • ajb117 an hour ago

    Holy marketing

  • verdverm 2 hours ago

    I do wonder if this will meaningfully move the needle on agent assistants (coding, marketing, schedule my vacation, etc...) considering how much more compute (I would imagine) is needed for video / immersive environments during training and inference

    I suspect the calculus is more favorable for robotics

  • htrp an hour ago

    her company world labs is at the forefront of building spatial intelligence models

  • inshard 34 minutes ago

    Also good context here is Friston’s Free Energy Principle: A unified theory suggesting that all living systems, from simple organisms to the brain, must minimize "surprise" to maintain their form and survive. To do this, systems act to minimize a mathematical quantity called variational free energy, which is an upper bound on surprise. This involves constantly making predictions about the world, updating internal models based on sensory data, and taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors.

    Key distinction: Constant and continuous updating. I.e. feedback loops with observation, prediction, action (agency), and once more, observation.

    It should have survival and preservation as a fundamental architectural feature.

  • alyxya an hour ago

    Personally, I think the direction AI will go towards is having an AI brain with something like a LLM at its core augmented with various abilities like spatial intelligence, rather than models being designed with spatial reasoning at its core. Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning. Similar to how image generation models have struggled with generating the right number of fingers on hands, I would expect a world model designed to model physical space to not generalize the understanding of simple human ideas.

    • gf000 an hour ago

      > Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning

      I believe the zero hypothesis would be that a model natively understanding both would work best/come closest to human intelligence (and possibly other different modalities are also needed).

      Also, as a complete laymen, our language having several interconnections with spatial concepts would also point towards a multi-modal intelligence. (topic: place, subject: lying under or near, respect/prospect: look back/ahead, etc). In my understanding these connections only secondarily make their way into LLM's representations.

      • alyxya 7 minutes ago

        There's a difference between what a model is trained on and the inductive biases a model uses to generalize. It isn't as simple as saying training natively on everything. All existing models have certain things they generalize better and certain things they don't generalize due to their model architecture, and the architecture of world models I've seen don't seem as capable of universally generalizing as LLMs.

  • dauertewigkeit an hour ago

    Sutton: Reinforcement Learning

    LeCun: Energy Based Self-Supervised Learning

    Chollet: Program Synthesis

    Fei-Fei: ???

    Are there any others with hot takes on the future architectures and techniques needed for of A-not-quite-G-I?

    • yzydserd an hour ago

      > Fei-Fei: ???

      Underrated and unsung. Fei Fei Li first launched ImageNet way back in 2007, a hugely influential move sparking much of the computer vision deep learning that followed since. I remember in a lecture about 7 years ago jph00 saying "text is just waiting for its imagenet moment" -> then came the gpt explosion. Fei Fei was massively instrumental in where we are today.

      • byearthithatius an hour ago

        Curating a dataset is vastly different than introducing a new architectural approach. ImageNet is a database. Its not like inventing the convolutions for CNNs or the LSTM or a Transformer.

        • dauertewigkeit 43 minutes ago

          CNNs and Transformers are both really simple and intuitive so I don't think there is any stroke of genius in how they were devised.

          Their success is due to datasets and the tooling that allowed models to be trained on large amounts of data, sufficiently fast using GPU clusters.

          • yzydserd 40 minutes ago

            Exactly right. Neatly said by the author in the linked article.

            > I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs).

            Datasets + NNs + GPUs. Three "vastly different" advances that came together. ImageNet was THE dataset.

  • gradus_ad an hour ago

    I'd imagine Tesla's and Waymo's AI are at the forefront of spatial cognition... this is what has made me hesitant to dismiss the AI hype as a bubble. Once spatial cognition is solved to the extent that language has been solved, a range of applications currently unavailable will drive a tidal wave of compute demand. Beyond self driving, think fully autonomous drone swarms... Militaries around the world certainly are and they're salivating.

    • jandrewrogers an hour ago

      The automotive AIs are narrow pseudo-spatial models that are good at extracting spatial features from the environment to feed fairly simple non-spatial models. They don't really reason spatially in the same sense that an animal does. A tremendous amount of human cognitive effort goes into updating the maps that these systems rely on.

      • gradus_ad 43 minutes ago

        Help me understand - my mental model of how auto AI work is that they're using neural nets to process visual information and output a decision on where to move in relation to objects in the world around them. Yes they are moving in a constrained 2D space but is that not fundamentally what animals do?

        • abstractanimal 9 minutes ago

          What you're describing is what's known as an "end to end" model that takes in image pixels and outputs steering and throttle commands. What happens in an AV is that a bunch of ML models produce input for software written by human engineers, and so the output doesn't come from an entirely ML system, it's a mix of engineered and trained components for various identifiable tasks (perception, planning, prediction, controls).

    • dauertewigkeit an hour ago

      Tesla's problems with their multi camera non-Lidar system is precisely because they don't have any spacial cognition.

    • byearthithatius an hour ago

      100% agree but not just military. Self-driving vehicles will become the norm, robots to mow the lawn, clean the house, eventually humanoids that can interact like LLMs and be functional robots that help out around the house.

    • pharrington an hour ago

      Spatial cognition really means "autonomous robot," and nobody thinks Tesla or Waymo have the most advanced robots.

  • nothrowaways 2 hours ago

    "Invest in my startup"

    • ares623 2 hours ago

      Before the music stops

  • programjames an hour ago

    Far too much marketing speech, far too little math or theory, and completely misses the mark on the 'next frontier'. Maybe four years ago, spatial reasoning was the problem to solve, but by 2022 it was solved. All that remained was scaling up. The actual three next problems to solve (in order of when they will be solved) are:

    - Reinforcement Learning (2026)

    - General Intelligence (2027)

    - Continual Learning (2028)

    EDIT: lol, funny how the idiots downvote

    • whatever1 an hour ago

      Combinatorial search is also a solved problem. We just need a couple of Universes to scale it up.

      • programjames an hour ago

        If there isn't a path humans know how to take with their current technology, it isn't a solved problem. It's much different than people training an image model for research purposes, and knowing that $100m in compute is probably enough for a basic video model.

    • 7moritz7 an hour ago

      Hasn't RLHF and with LLM feedback been around for years now

      • programjames an hour ago

        Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.

        • storus an hour ago

          We might not even need RL as DPO has shown.

          • programjames 4 minutes ago

            > if you purely use policy optimization, RLHF will be biased towards short horizons

            > most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly

    • l9o an hour ago

      What do you consider "General Intelligence" to be?

      • programjames an hour ago

        A good start would be:

        1. Robust to adversarial attacks (e.g. in classification models or LLM steering).

        2. Solving ARC-AGI.

        Current models are optimized to solve the current problem they're presented, not really find the most general problem-solving techniques.

    • koakuma-chan an hour ago

      In my thinking what AI lacks is a memory system

      • 7moritz7 an hour ago

        That has been solved with RAG, OCR-ish image encoding (deepseek recently) and just long context windows in general.

        • koakuma-chan an hour ago

          Not really. For example we still can’t get coding agents to work reliably, and I think it’s a memory problem, not a capabilities problem.

  • frenchie4111 25 minutes ago

    I enjoy Fei-fei li's communication style. It's straight and to the point in a way that I find very easy to parse. She's one of my primary idols in the AI space these days.