Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. We introduce a method for reconstructing generic dynamic scenes, featuring explicit, persistent 3D motion trajectories in the world coordinate frame, from casually captured monocular videos. We tackle the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE(3) motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we take advantage of off-the-shelf data-driven priors such as monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes. Project Page: https://shape-of-motion.github.io/
翻译:单目动态重建因其高度不适定性而成为一项具有挑战性且长期存在的视觉难题。现有方法依赖于模板、仅在准静态场景中有效,或未能显式建模三维运动。我们提出了一种方法,用于从随意拍摄的单目视频中重建通用动态场景,其特点是在世界坐标系中具有显式、持久的三维运动轨迹。我们通过两个关键见解来解决该问题:首先,我们利用三维运动的低维结构,通过一组紧凑的SE(3)运动基来表示场景运动。每个点的运动被表达为这些基的线性组合,从而促进场景软分解为多个刚性运动组。其次,我们利用现成的数据驱动先验,如单目深度图和长程二维轨迹,并设计了一种方法来有效整合这些含噪声的监督信号,从而得到动态场景的全局一致表示。实验表明,我们的方法在动态场景的长程三维/二维运动估计和新视角合成方面均达到了最先进的性能。项目页面:https://shape-of-motion.github.io/