DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

Existing VLMs can track in-the-wild 2D video objects while current generative models provide powerful visual priors for synthesizing novel views for the highly under-constrained 2D-to-3D object lifting. Building upon this exciting progress, we present DreamScene4D, the first approach that can generate three-dimensional dynamic scenes of multiple objects from monocular in-the-wild videos with large object motion across occlusions and novel viewpoints. Our key insight is to design a "decompose-then-recompose" scheme to factorize both the whole video scene and each object's 3D motion. We first decompose the video scene by using open-vocabulary mask trackers and an adapted image diffusion model to segment, track, and amodally complete the objects and background in the video. Each object track is mapped to a set of 3D Gaussians that deform and move in space and time. We also factorize the observed motion into multiple components to handle fast motion. The camera motion can be inferred by re-rendering the background to match the video frames. For the object motion, we first model the object-centric deformation of the objects by leveraging rendering losses and multi-view generative priors in an object-centric frame, then optimize object-centric to world-frame transformations by comparing the rendered outputs against the perceived pixel and optical flow. Finally, we recompose the background and objects and optimize for relative object scales using monocular depth prediction guidance. We show extensive results on the challenging DAVIS, Kubric, and self-captured videos, detail some limitations, and provide future directions. Besides 4D scene generation, our results show that DreamScene4D enables accurate 2D point motion tracking by projecting the inferred 3D trajectories to 2D, while never explicitly trained to do so.

翻译：现有视觉语言模型可追踪自然场景中的二维视频对象，而当前生成模型为高度欠约束的二维到三维对象提升任务提供了强大的新视角合成先验。基于这一激动人心的进展，我们提出DreamScene4D——首个能够从单目自然视频中生成包含多个对象的三维动态场景的方法，该场景可处理跨遮挡的大幅度对象运动及新视角挑战。我们的核心思想是设计"分解-重组"方案，将整个视频场景与每个对象的三维运动进行分解。首先，通过开放词汇掩码追踪器和适配后的图像扩散模型对视频场景进行分解，实现对象与背景的分割、追踪及模态补全。每个对象轨迹被映射为一组在时空维度变形的三维高斯分布。我们还将观测运动分解为多个分量以处理快速运动：通过重渲染背景与视频帧匹配来推断相机运动；针对对象运动，首先在对象中心坐标系下利用渲染损失和多视角生成先验建模对象中心形变，再通过比较渲染输出与感知到的像素和光流来优化对象中心到世界坐标的变换。最后，利用单目深度预测指导重组背景与对象并优化相对对象尺度。我们在具有挑战性的DAVIS、Kubric及自主采集视频上展示了广泛结果，讨论了当前局限与未来方向。除四维场景生成外，实验表明DreamScene4D通过将推断的三维轨迹投影至二维可实现精确的二维点运动追踪——尽管该模型从未针对此任务进行显式训练。