In this paper, we aim to model 3D scene dynamics from multi-view videos. Unlike the majority of existing works which usually focus on the common task of novel view synthesis within the training time period, we propose to simultaneously learn the geometry, appearance, and physical velocity of 3D scenes only from video frames, such that multiple desirable applications can be supported, including future frame extrapolation, unsupervised 3D semantic scene decomposition, and dynamic motion transfer. Our method consists of three major components, 1) the keyframe dynamic radiance field, 2) the interframe velocity field, and 3) a joint keyframe and interframe optimization module which is the core of our framework to effectively train both networks. To validate our method, we further introduce two dynamic 3D datasets: 1) Dynamic Object dataset, and 2) Dynamic Indoor Scene dataset. We conduct extensive experiments on multiple datasets, demonstrating the superior performance of our method over all baselines, particularly in the critical tasks of future frame extrapolation and unsupervised 3D semantic scene decomposition.
翻译:本文旨在从多视角视频中建模3D场景动态。与现有大多数研究通常聚焦于训练时间周期内的新颖视图合成这一常见任务不同,我们提出仅从视频帧中同步学习3D场景的几何、外观与物理速度,从而支持包括未来帧外推、无监督3D语义场景分解及动态运动迁移在内的多项理想应用。我们的方法包含三个主要组件:1)关键帧动态辐射场,2)帧间速度场,以及3)联合关键帧与帧间优化模块——该模块作为本框架的核心,可有效训练两个网络。为验证方法有效性,我们进一步引入两个动态3D数据集:1)动态物体数据集,2)动态室内场景数据集。我们在多个数据集上开展广泛实验,证明本方法在所有基线方案中表现出更优性能,特别是在未来帧外推与无监督3D语义场景分解等关键任务上尤为显著。