Understanding dynamic scenes from casual videos is critical for scalable robot learning, yet four-dimensional (4D) reconstruction under strictly monocular settings remains highly ill-posed. To address this challenge, our key insight is that real-world dynamics exhibits a multi-scale regularity from object to particle level. To this end, we design the multi-scale dynamics mechanism that factorizes complex motion fields. Within this formulation, we propose Gaussian sequences with multi-scale dynamics, a novel representation for dynamic 3D Gaussians derived through compositions of multi-level motion. This layered structure substantially alleviates ambiguity of reconstruction and promotes physically plausible dynamics. We further incorporate multi-modal priors from vision foundation models to establish complementary supervision, constraining the solution space and improving the reconstruction fidelity. Our approach enables accurate and globally consistent 4D reconstruction from monocular casual videos. Experiments of dynamic novel-view synthesis (NVS) on benchmark and real-world manipulation datasets demonstrate considerable improvements over existing methods.
翻译:从随意视频中理解动态场景对于可扩展的机器人学习至关重要,然而在严格单目设置下的四维(4D)重建问题仍然高度不适定。为应对这一挑战,我们的核心见解是:真实世界的动态在从物体到粒子层面展现出多尺度规律性。为此,我们设计了多尺度动态机制,用于分解复杂的运动场。在此框架下,我们提出了具有多尺度动态的高斯序列——一种通过多层级运动组合推导出的动态三维高斯新颖表示。这种分层结构显著缓解了重建的模糊性,并促进了物理上合理的动态。我们进一步整合了来自视觉基础模型的多模态先验知识,以建立互补的监督,约束解空间并提升重建保真度。我们的方法能够从单目随意视频中实现精确且全局一致的四维重建。在基准数据集和真实世界操作数据集上进行的动态新视角合成(NVS)实验表明,相较于现有方法,本方法取得了显著改进。