Marble: A Multimodal World Model

(worldlabs.ai)

85 points | by meetpateltech 3 hours ago ago

12 comments

  • msteffen 26 minutes ago

    I understand that DeepMind is working on this too: https://deepmind.google/blog/genie-3-a-new-frontier-for-worl...

    I wonder how their approaches and results compare?

  • abixb 19 minutes ago

    As someone with barebones understanding of "world models," how does this differ from sophisticated game engines that generate three-dimensional worlds? Is it simply the adaptation of transformer architecture in generating the 3-D world v/s using a static/predictable script as in game engines (learned dynamics vs deterministic simulation mimicking 'generation')? Would love an explanation from SMEs.

    • mountainriver 16 minutes ago

      The model is predicting what the state of the world would look like after a given action.

      Along with entertainment, they can be used for simulation training for robots. And allow for imagining potential trajectories

      • ghayes 10 minutes ago

        Whenever I see these and play with models like this (and the demos on this page), the movement in the world always feel like a dolly zoom. Things in the distance tend to stay in the distance, even as the camera moves in that direction, and only the local area changes features.

        [0] https://en.wikipedia.org/wiki/Dolly_zoom

      • abixb 12 minutes ago

        Interesting. Given how LLMs can also "compute" the state after an action it might perform (once you feed it all the variables), are world models just a more visual heavy iteration of an LLM? Can Tesla's FSD (which uses transformer architecture) be considered a world model given the fact that it literally uses the IRL world (roads) as its input?

      • echelon 9 minutes ago

        Marble is not that type of world model. It generates static Gaussian Splat assets that you can render using 3D libraries.

    • echelon 10 minutes ago

      This "world model" is Image to Gaussian Splat. This is a static render that a web-based Gaussian Splat viewer then renders.

      Other "world model"s are Image + (keyboard input) to Video or Streaming Images, that effectively function like a game engine / video hybrid.

  • girfan 34 minutes ago

    This seems very interesting. Timely, given that Yann LeCun's vision also seems to align with world models being the next frontier: https://news.ycombinator.com/item?id=45897271

    • lofties 31 minutes ago

      An established founder makes claims X is the new frontier. X receives hundreds of millions in funding. Other less established founders claim they are working on X too. VCs suffering from terminal FOMO pump billions more into X. X becomes the next frontier. The previous frontiers are promptly forgotten about.

  • keyle an hour ago

    I'm floored. Incredible work.

    also check out their interactive examples on the webapp. It's a bit more rough around the edges but shows real user input/output. Arguably such examples could be pushed further to better quality output.

    e.g. https://marble.worldlabs.ai/world/b75af78a-b040-4415-9f42-6d...

    e.g. https://marble.worldlabs.ai/world/cbd8d6fb-4511-4d2c-a941-f4...

  • hobofan 3 hours ago
  • cubefox 2 hours ago

    Impressive!