The challenge of dynamic view synthesis from dynamic monocular videos, i.e., synthesizing novel views for free viewpoints given a monocular video of a dynamic scene captured by a moving camera, mainly lies in accurately modeling the dynamic objects of a scene using limited 2D frames, each with a varying timestamp and viewpoint. Existing methods usually require pre-processed 2D optical flow and depth maps by additional methods to supervise the network, making them suffer from the inaccuracy of the pre-processed supervision and the ambiguity when lifting the 2D information to 3D. In this paper, we tackle this challenge in an unsupervised fashion. Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints. The former enforces the 3D geometric surfaces of moving objects to be consistent over time, while the latter regularizes their appearances to be consistent across different viewpoints. Such a fine-grained motion formulation can alleviate the learning difficulty for the network, thus enabling it to produce not only novel views with higher quality but also more accurate scene flows and depth than existing methods requiring extra supervision. We will make the code publicly available.
翻译:从动态单目视频中合成动态视角(即给定由移动相机拍摄的动态场景单目视频,为自由视点合成新视角)的挑战主要在于:利用有限的二维帧(每帧具有不同的时间戳和视角)精确建模场景中的动态物体。现有方法通常需要借助额外方法预处理二维光流和深度图来监督网络,导致其受到预处理监督的不准确性以及将二维信息提升至三维时的歧义性问题。本文以无监督方式应对这一挑战。具体而言,我们将动态物体的运动解耦为物体运动和相机运动,分别通过提出的无监督表面一致性和基于块的多视角约束进行正则化。前者强制运动物体的三维几何表面随时间保持一致,后者则将其外观在不同视角间保持一致性。这种细粒度的运动建模可降低网络的学习难度,使其不仅能生成更高质量的新视角,还能产生比依赖额外监督的现有方法更准确的场景流与深度。我们将公开代码。