The goal of our work is to generate high-quality novel views from monocular videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have shown impressive performance by leveraging time-varying dynamic radiation fields. However, these methods have limitations when it comes to accurately modeling the motion of complex objects, which can lead to inaccurate and blurry renderings of details. To address this limitation, we propose a novel approach that builds upon a recent generalization NeRF, which aggregates nearby views onto new viewpoints. However, such methods are typically only effective for static scenes. To overcome this challenge, we introduce a module that operates in both the time and frequency domains to aggregate the features of object motion. This allows us to learn the relationship between frames and generate higher-quality images. Our experiments demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets. Specifically, our approach outperforms existing methods in terms of both the accuracy and visual quality of the synthesized views. Our code is available on https://github.com/xingy038/CTNeRF.
翻译:本工作的目标是从复杂动态场景的单目视频中生成高质量的新视角图像。先前的方法(如DynamicNeRF)通过利用时变动态辐射场已展现出令人印象深刻的性能。然而,这些方法在准确建模复杂物体运动方面存在局限,可能导致细节渲染不准确且模糊。为解决这一局限,我们提出一种新方法,该方法基于近期提出的泛化NeRF框架,该框架能够将邻近视角的信息聚合到新视点上。然而,此类方法通常仅对静态场景有效。为克服这一挑战,我们引入了一个在时域和频域同时操作的模块,以聚合物体运动的特征。这使得我们能够学习帧间关系并生成更高质量的图像。我们的实验在动态场景数据集上展示了相较于现有最先进方法的显著改进。具体而言,我们的方法在合成视角的准确性和视觉质量方面均优于现有方法。代码公开于https://github.com/xingy038/CTNeRF。