We introduce an approach to enhance the novel view synthesis from images taken from a freely moving camera. The introduced approach focuses on outdoor scenes where recovering accurate geometric scaffold and camera pose is challenging, leading to inferior results using the state-of-the-art stable view synthesis (SVS) method. SVS and related methods fail for outdoor scenes primarily due to (i) over-relying on the multiview stereo (MVS) for geometric scaffold recovery and (ii) assuming COLMAP computed camera poses as the best possible estimates, despite it being well-studied that MVS 3D reconstruction accuracy is limited to scene disparity and camera-pose accuracy is sensitive to key-point correspondence selection. This work proposes a principled way to enhance novel view synthesis solutions drawing inspiration from the basics of multiple view geometry. By leveraging the complementary behavior of MVS and monocular depth, we arrive at a better scene depth per view for nearby and far points, respectively. Moreover, our approach jointly refines camera poses with image-based rendering via multiple rotation averaging graph optimization. The recovered scene depth and the camera-pose help better view-dependent on-surface feature aggregation of the entire scene. Extensive evaluation of our approach on the popular benchmark dataset, such as Tanks and Temples, shows substantial improvement in view synthesis results compared to the prior art. For instance, our method shows 1.5 dB of PSNR improvement on the Tank and Temples. Similar statistics are observed when tested on other benchmark datasets such as FVS, Mip-NeRF 360, and DTU.
翻译:我们提出了一种方法,用于增强从自由移动相机拍摄的图像中进行的新视角合成。该方法聚焦于室外场景,其中恢复精确的几何支架和相机位姿极具挑战性,导致基于最先进稳定视图合成(SVS)方法的结果较差。SVS及相关方法在室外场景中失败的主要原因在于:(i)过度依赖多视图立体视觉(MVS)来恢复几何支架,以及(ii)假设COLMAP计算的相机位姿是最佳估计,尽管已有充分研究表明MVS三维重建的精度受限于场景视差,而相机位姿精度对关键点对应选择敏感。本文提出了一种原则性方法来增强新视角合成方案,其灵感来源于多视图几何的基本原理。通过利用MVS与单目深度估计的互补特性,我们分别对近点和远点获得了更好的每视图场景深度。此外,我们的方法通过多重旋转平均图优化,联合优化了基于图像渲染的相机位姿。恢复的场景深度和相机位姿有助于更好地对场景进行视角相关的表面特征聚合。我们在Tanks and Temples等公开基准数据集上对所提方法进行了广泛评估,结果显示与先前技术相比,视图合成结果有显著改进。例如,在Tanks and Temples数据集上,我们的方法实现了1.5 dB的PSNR提升。在FVS、Mip-NeRF 360和DTU等其他基准数据集上测试时也观察到了相似的数据表现。