We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of MLPs, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects, but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. Our project webpage is at dynibar.github.io.
翻译:摘要:我们探讨了从单目视频中合成复杂动态场景新视角的问题。基于时变神经辐射场(即动态NeRF)的最先进方法在此任务中已展现出令人瞩目的成果。然而,对于包含复杂物体运动和不受控相机轨迹的长视频,这些方法可能产生模糊或失真的渲染结果,限制了其在真实场景中的应用。本文并未将整个动态场景编码至MLP权重中,而是提出了一种新方法,通过采用体素图像渲染框架,以场景运动感知的方式聚合相邻视角的特征来克服上述局限。该系统既保留了先前方法建模复杂场景和视角相关效应的优势,又能从具有复杂动态场景和不受限相机轨迹的长视频中合成照片级真实感的新视角。我们在动态场景数据集上相较现有方法取得了显著改进,并将该方法应用于存在挑战性相机及物体运动的野生视频中(此前方法在此类场景中无法生成高质量渲染)。项目主页见dynibar.github.io。