This is the general idea of how I understood it works.
- It's based on WAN a video model, it generates a frame at a time in latent space that we later we decode into RGB
- We keep the latent space and its RGB of each frame in a database. Along with the RGB we compute depth so we get a cloud point (RGBD). This will be used for persistence.
- For each new frame we check which past frames have their point cloud contained in the camera frustum. We take the top 3 frames with more overlap and get their latent space.
- We feed these 3 frames to WAN via the cross attention layers for conditioning and that is how we achieve consistency
- The RGBD can be used to generate a gaussian splatting of the scene.
https://www.youtube.com/watch?v=eCw33snvoNI
This is the general idea of how I understood it works.
- It's based on WAN a video model, it generates a frame at a time in latent space that we later we decode into RGB
- We keep the latent space and its RGB of each frame in a database. Along with the RGB we compute depth so we get a cloud point (RGBD). This will be used for persistence.
- For each new frame we check which past frames have their point cloud contained in the camera frustum. We take the top 3 frames with more overlap and get their latent space.
- We feed these 3 frames to WAN via the cross attention layers for conditioning and that is how we achieve consistency
- The RGBD can be used to generate a gaussian splatting of the scene.