In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization.
翻译:本文提出一种基于分解低秩张量表示的联合优化算法,仅利用二维图像作为监督信息,即可联合优化相机位姿与场景几何表征。首先,我们基于一维信号开展先导实验,将相关发现推广至三维场景,揭示基于体素的NeRF网络中朴素联合位姿优化易陷入次优解的机理。进一步,通过对频谱特性的分析,提出在二维和三维辐射场中应用卷积高斯滤波器,构建从粗到细的训练策略,实现相机位姿的联合优化。利用分解低秩张量的分解特性,本方法以极小计算开销实现与暴力三维卷积等效的效果。为增强联合优化的鲁棒性与稳定性,还提出平滑二维监督、随机尺度核参数及边缘引导损失掩码三项技术。大量定量与定性评估表明,本框架在新视角合成任务中表现卓越,且优化收敛速度显著提升。