Despite the advancements in deep learning for camera relocalization tasks, obtaining ground truth pose labels required for the training process remains a costly endeavor. While current weakly supervised methods excel in lightweight label generation, their performance notably declines in scenarios with sparse views. In response to this challenge, we introduce WSCLoc, a system capable of being customized to various deep learning-based relocalization models to enhance their performance under weakly-supervised and sparse view conditions. This is realized with two stages. In the initial stage, WSCLoc employs a multilayer perceptron-based structure called WFT-NeRF to co-optimize image reconstruction quality and initial pose information. To ensure a stable learning process, we incorporate temporal information as input. Furthermore, instead of optimizing SE(3), we opt for $\mathfrak{sim}(3)$ optimization to explicitly enforce a scale constraint. In the second stage, we co-optimize the pre-trained WFT-NeRF and WFT-Pose. This optimization is enhanced by Time-Encoding based Random View Synthesis and supervised by inter-frame geometric constraints that consider pose, depth, and RGB information. We validate our approaches on two publicly available datasets, one outdoor and one indoor. Our experimental results demonstrate that our weakly-supervised relocalization solutions achieve superior pose estimation accuracy in sparse-view scenarios, comparable to state-of-the-art camera relocalization methods. We will make our code publicly available.
翻译:尽管深度学习在相机重定位任务上取得了显著进展,但获取训练过程中所需的真实姿态标签仍是一项成本高昂的工作。现有弱监督方法虽然在轻量级标签生成方面表现优异,但其性能在稀疏视角场景下显著下降。针对这一挑战,我们提出WSCLoc系统,该系统可灵活适配各类基于深度学习的重定位模型,以增强其在弱监督和稀疏视角条件下的性能。该系统通过两个阶段实现:初始阶段中,WSCLoc采用名为WFT-NeRF的多层感知机结构,协同优化图像重建质量与初始姿态信息。为确保学习过程的稳定性,我们将时间信息作为输入纳入模型。此外,我们选择$\mathfrak{sim}(3)$优化替代SE(3)优化,以显式施加尺度约束。第二阶段中,我们对预训练的WFT-NeRF与WFT-Pose进行联合优化,该优化过程通过基于时间编码的随机视角合成加以增强,并受考虑姿态、深度与RGB信息的帧间几何约束监督。我们在两个公开数据集(室外与室内各一个)上验证了方法有效性。实验结果表明,我们提出的弱监督重定位方案在稀疏视角场景下实现了与最先进相机重定位方法相当的优越姿态估计精度。相关代码将公开发布。