Visual localization is a key technique to a variety of applications, e.g., autonomous driving, AR/VR, and robotics. For these real applications, both efficiency and accuracy are important especially on edge devices with limited computing resources. However, previous frameworks, e.g., absolute pose regression (APR), scene coordinate regression (SCR), and the hierarchical method (HM), have limited either accuracy or efficiency in both indoor and outdoor environments. In this paper, we propose the place recognition anywhere model (PRAM), a new framework, to perform visual localization efficiently and accurately by recognizing 3D landmarks. Specifically, PRAM first generates landmarks directly in 3D space in a self-supervised manner. Without relying on commonly used classic semantic labels, these 3D landmarks can be defined in any place in indoor and outdoor scenes with higher generalization ability. Representing the map with 3D landmarks, PRAM discards global descriptors, repetitive local descriptors, and redundant 3D points, increasing the memory efficiency significantly. Then, sparse keypoints, rather than dense pixels, are utilized as the input tokens to a transformer-based recognition module for landmark recognition, which enables PRAM to recognize hundreds of landmarks with high time and memory efficiency. At test time, sparse keypoints and predicted landmark labels are utilized for outlier removal and landmark-wise 2D-3D matching as opposed to exhaustive 2D-2D matching, which further increases the time efficiency. A comprehensive evaluation of APRs, SCRs, HMs, and PRAM on both indoor and outdoor datasets demonstrates that PRAM outperforms ARPs and SCRs in large-scale scenes with a large margin and gives competitive accuracy to HMs but reduces over 90\% memory cost and runs 2.4 times faster, leading to a better balance between efficiency and accuracy.
翻译:视觉定位是自动驾驶、增强现实/虚拟现实(AR/VR)和机器人等众多应用的关键技术。在这些实际应用中,尤其是在计算资源有限的边缘设备上,效率与精度同等重要。然而,现有框架(如绝对姿态回归(APR)、场景坐标回归(SCR)和分层方法(HM))在室内外环境中的精度或效率均存在局限。本文提出了一种名为“任意场景地点识别模型(PRAM)”的新框架,通过识别三维地标来实现高效且精确的视觉定位。具体而言,PRAM首先以自监督方式直接在三维空间中生成地标。这些三维地标不依赖常用的经典语义标签,可在室内外场景的任何位置进行定义,具有更强的泛化能力。通过使用三维地标表示地图,PRAM摒弃了全局描述符、重复的局部描述符和冗余的三维点,显著提升了内存效率。随后,系统将稀疏关键点(而非密集像素)作为基于Transformer的识别模块的输入标记进行地标识别,使PRAM能够以较高的时间和内存效率识别数百个地标。在测试阶段,利用稀疏关键点和预测的地标标签进行异常值剔除及地标级的2D-3D匹配(而非耗时的穷举式2D-2D匹配),进一步提升了时间效率。通过在室内外数据集上对APR、SCR、HM及PRAM进行的综合评估表明:PRAM在大规模场景中显著优于APR与SCR;在与HM精度相当的同时,内存消耗降低90%以上,运行速度提升2.4倍,实现了效率与精度的更优平衡。