A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose

Novel view synthesis from a sparse set of input images is a challenging problem of great practical interest, especially when camera poses are absent or inaccurate. Direct optimization of camera poses and usage of estimated depths in neural radiance field algorithms usually do not produce good results because of the coupling between poses and depths, and inaccuracies in monocular depth estimation. In this paper, we leverage the recent 3D Gaussian splatting method to develop a novel construct-and-optimize method for sparse view synthesis without camera poses. Specifically, we construct a solution progressively by using monocular depth and projecting pixels back into the 3D world. During construction, we optimize the solution by detecting 2D correspondences between training views and the corresponding rendered images. We develop a unified differentiable pipeline for camera registration and adjustment of both camera poses and depths, followed by back-projection. We also introduce a novel notion of an expected surface in Gaussian splatting, which is critical to our optimization. These steps enable a coarse solution, which can then be low-pass filtered and refined using standard optimization methods. We demonstrate results on the Tanks and Temples and Static Hikes datasets with as few as three widely-spaced views, showing significantly better quality than competing methods, including those with approximate camera pose information. Moreover, our results improve with more views and outperform previous InstantNGP and Gaussian Splatting algorithms even when using half the dataset. Project page: https://raymondjiangkw.github.io/cogs.github.io/

翻译：从稀疏输入图像集进行新视角合成是一个具有重要实际意义的挑战性问题，尤其在相机位姿缺失或不准确的情况下。由于位姿与深度的耦合关系以及单目深度估计的不准确性，在神经辐射场算法中直接优化相机位姿和使用估计深度通常无法产生良好结果。本文利用最新的3D高斯泼溅方法，开发了一种无需相机位姿的稀疏视图合成新方法——构建-优化法。具体而言，我们通过使用单目深度并将像素反投影至三维世界来逐步构建解决方案。在构建过程中，我们通过检测训练视图与对应渲染图像之间的二维对应关系来优化解决方案。我们开发了一个统一的、可微分的流程，用于相机标定以及相机位姿和深度的调整，随后进行反投影。我们还引入了高斯泼溅中预期表面的新概念，这对我们的优化至关重要。这些步骤产生一个粗略解，随后可通过低通滤波和使用标准优化方法进行细化。我们在Tanks and Temples和Static Hikes数据集上展示了结果，仅使用三个间隔较宽的视图，其质量显著优于包括那些使用近似相机位姿信息的方法在内的竞争方法。此外，我们的结果随着视图数量的增加而改善，即使仅使用一半数据集，其性能也优于之前的InstantNGP和高斯泼溅算法。项目页面：https://raymondjiangkw.github.io/cogs.github.io/