Learning-based visual relocalizers exhibit leading pose accuracy, but require hours or days of training. Since training needs to happen on each new scene again, long training times make learning-based relocalization impractical for most applications, despite its promise of high accuracy. In this paper we show how such a system can actually achieve the same accuracy in less than 5 minutes. We start from the obvious: a relocalization network can be split in a scene-agnostic feature backbone, and a scene-specific prediction head. Less obvious: using an MLP prediction head allows us to optimize across thousands of view points simultaneously in each single training iteration. This leads to stable and extremely fast convergence. Furthermore, we substitute effective but slow end-to-end training using a robust pose solver with a curriculum over a reprojection loss. Our approach does not require privileged knowledge, such a depth maps or a 3D model, for speedy training. Overall, our approach is up to 300x faster in mapping than state-of-the-art scene coordinate regression, while keeping accuracy on par.
翻译:基于学习的视觉重定位器可实现领先的位姿精度,但需数小时甚至数天的训练。由于需在每个新场景中重新训练,冗长的训练周期使得基于学习的重定位方法尽管具有高精度潜力,却难以实际应用。本文展示了一种能在5分钟内达成同等精度的系统方案。关键发现:重定位网络可分解为场景无关的特征骨干网络与场景特定的预测头部。更深入的发现:采用MLP预测头部允许我们在单次训练迭代中同时优化数千个视点,从而实现稳定且极快的收敛。此外,我们采用基于重投影损失的课程学习策略,替代了需结合鲁本位姿求解器的低效端到端训练。本方法无需深度图或三维模型等先验知识即可实现快速训练。实验表明,本方法在保持与现有最优场景坐标回归方法同等精度的前提下,建图速度提升高达300倍。