The goal of our work is to generate high-quality novel views from monocular videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have shown impressive performance by leveraging time-varying dynamic radiation fields. However, these methods have limitations when it comes to accurately modeling the motion of complex objects, which can lead to inaccurate and blurry renderings of details. To address this limitation, we propose a novel approach that builds upon a recent generalization NeRF, which aggregates nearby views onto new viewpoints. However, such methods are typically only effective for static scenes. To overcome this challenge, we introduce a module that operates in both the time and frequency domains to aggregate the features of object motion. This allows us to learn the relationship between frames and generate higher-quality images. Our experiments demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets. Specifically, our approach outperforms existing methods in terms of both the accuracy and visual quality of the synthesized views.
翻译:本研究旨在从复杂动态场景的单目视频中生成高质量的新视角图像。先前的方法(如DynamicNeRF)通过利用时变动态辐射场展现了显著性能,但在精确建模复杂物体运动方面存在局限性,导致细节渲染不准确且模糊。为解决这一局限,我们提出了一种基于最新泛化NeRF的创新方法,该方法将邻近视角聚合到新视点上。然而,这类方法通常仅适用于静态场景。为克服这一挑战,我们引入了一个同时在时域和频域中操作的模块,用于聚合物体运动的特征,从而学习帧间关联并生成更高质量的图像。实验表明,我们的方法在动态场景数据集上显著优于现有最优方法,具体体现在合成视角的准确性和视觉质量两方面。