The introduction of neural radiance fields has greatly improved the effectiveness of view synthesis for monocular videos. However, existing algorithms face difficulties when dealing with uncontrolled or lengthy scenarios, and require extensive training time specific to each new scenario. To tackle these limitations, we propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos. Rather than encoding the entirety of the scenario information into a latent representation, DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation. Specifically, this correspondence prediction is achieved through the estimation of consistent depth and scene flow information across frames. Subsequently, the acquired correspondence is utilized to aggregate information from multiple reference frames to a target frame, by constructing hierarchical neural point clouds. The resulting framework enables swift and accurate view synthesis for desired views of target frames. The experimental results obtained demonstrate the considerable acceleration of training time achieved - typically an order of magnitude - by our proposed method while yielding comparable outcomes compared to prior approaches. Furthermore, our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.
翻译:神经辐射场的引入极大地提升了单目视频视角合成的有效性。然而,现有算法在处理非受控或长时场景时面临困难,且需要针对每个新场景进行大量耗时训练。为解决这些限制,我们提出DynPoint——一种旨在实现非约束单目视频中快速新视角合成的算法。不同于将整个场景信息编码为隐式表示,DynPoint专注于预测相邻帧之间的显式三维对应关系以实现信息聚合。具体而言,该对应关系预测通过估计帧间一致的深度和场景流信息来实现。随后,利用所获取的对应关系,通过构建分层神经点云将多个参考帧的信息聚合至目标帧。该框架能够对目标帧的期望视角实现快速准确的视角合成。实验结果表明,与先前方法相比,所提方法在取得可比结果的同时,通常能将训练时间加速一个数量级。此外,我们的方法在学习视频内容的规范表示下,对长时视频处理展现出强鲁棒性。