Visual (re)localization is critical for various applications in computer vision and robotics. Its goal is to estimate the 6 degrees of freedom (DoF) camera pose for each query image, based on a set of posed database images. Currently, all leading solutions are structure-based that either explicitly construct 3D metric maps from the database with structure-from-motion, or implicitly encode the 3D information with scene coordinate regression models. On the contrary, visual localization without reconstructing the scene in 3D offers clear benefits. It makes deployment more convenient by reducing database pre-processing time, releasing storage requirements, and remaining unaffected by imperfect reconstruction, etc. In this technical report, we demonstrate that it is possible to achieve high localization accuracy without reconstructing the scene from the database. The key to achieving this owes to a tailored motion averaging over database-query pairs. Experiments show that our visual localization proposal, LazyLoc, achieves comparable performance against state-of-the-art structure-based methods. Furthermore, we showcase the versatility of LazyLoc, which can be easily extended to handle complex configurations such as multi-query co-localization and camera rigs.
翻译:视觉(重)定位在计算机视觉与机器人领域的诸多应用中至关重要,其目标是根据一组带有位姿的数据库图像,为每张查询图像估计6自由度(DoF)相机位姿。目前,所有主流解决方案均基于结构化方法,要么通过运动恢复结构从数据库显式构建三维度量地图,要么利用场景坐标回归模型隐式编码三维信息。与之相反,无需重建三维场景的视觉定位方法具有显著优势:通过减少数据库预处理时间、降低存储需求,且不受不完美重建的影响等,使部署更为便捷。本技术报告证明,在不从数据库重建场景的前提下,实现高精度定位是可行的。实现这一目标的关键在于对数据库-查询图像对采用定制化的运动平均策略。实验表明,我们的视觉定位方案LazyLoc在性能上可与最先进的结构化方法相媲美。此外,我们展示了LazyLoc的通用性——可轻松扩展至多查询协同定位与相机阵列等复杂配置场景。