The introduction of neural radiance fields has greatly improved the effectiveness of view synthesis for monocular videos. However, existing algorithms face difficulties when dealing with uncontrolled or lengthy scenarios, and require extensive training time specific to each new scenario. To tackle these limitations, we propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos. Rather than encoding the entirety of the scenario information into a latent representation, DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation. Specifically, this correspondence prediction is achieved through the estimation of consistent depth and scene flow information across frames. Subsequently, the acquired correspondence is utilized to aggregate information from multiple reference frames to a target frame, by constructing hierarchical neural point clouds. The resulting framework enables swift and accurate view synthesis for desired views of target frames. The experimental results obtained demonstrate the considerable acceleration of training time achieved - typically an order of magnitude - by our proposed method while yielding comparable outcomes compared to prior approaches. Furthermore, our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.
翻译:神经辐射场的引入极大地提升了单目视频视图合成的有效性。然而,现有算法在处理无约束或长时域场景时仍面临困难,且需要针对每个新场景进行大量训练。为解决这些局限,我们提出DynPoint——一种旨在快速合成无约束单目视频新视角的算法。该算法不将整个场景信息编码为隐式表征,而是专注于预测相邻帧间的显式三维对应关系以实现信息聚合。具体而言,通过估计帧间一致的深度与场景流信息来实现对应关系预测。随后,利用获取的对应关系,通过构建分层神经点云,将多个参考帧的信息聚合到目标帧。该框架能够为目标帧的期望视角实现快速且准确的视图合成。实验结果表明,与先前方法相比,本方法在取得可比较结果的同时,训练时间通常可加速一个数量级。此外,本方法在未学习视频内容规范表征的情况下,对长时域视频展现出强鲁棒性。