Visual localization is a fundamental task for various applications including autonomous driving and robotics. Prior methods focus on extracting large amounts of often redundant locally reliable features, resulting in limited efficiency and accuracy, especially in large-scale environments under challenging conditions. Instead, we propose to extract globally reliable features by implicitly embedding high-level semantics into both the detection and description processes. Specifically, our semantic-aware detector is able to detect keypoints from reliable regions (e.g. building, traffic lane) and suppress unreliable areas (e.g. sky, car) implicitly instead of relying on explicit semantic labels. This boosts the accuracy of keypoint matching by reducing the number of features sensitive to appearance changes and avoiding the need of additional segmentation networks at test time. Moreover, our descriptors are augmented with semantics and have stronger discriminative ability, providing more inliers at test time. Particularly, experiments on long-term large-scale visual localization Aachen Day-Night and RobotCar-Seasons datasets demonstrate that our model outperforms previous local features and gives competitive accuracy to advanced matchers but is about 2 and 3 times faster when using 2k and 4k keypoints, respectively.
翻译:视觉定位是自动驾驶和机器人等多个应用领域中的基础任务。现有方法侧重于提取大量且通常冗余的局部可靠特征,导致在大型环境及具有挑战性的条件下效率和准确性受限。为此,我们提出通过将高层语义隐式嵌入到检测和描述过程中,提取全局可靠特征。具体而言,我们的语义感知检测器能够隐式地从可靠区域(如建筑、车道)中检测关键点,并抑制不可靠区域(如天空、汽车),而无需依赖明确的语义标签。该方法通过减少对表观变化敏感的特征数量,并避免在测试时额外使用分割网络,提升了关键点匹配的准确率。此外,我们的描述子通过语义增强具备更强的判别能力,在测试时提供更多内点。特别地,在长期大规模视觉定位数据集Aachen Day-Night与RobotCar-Seasons上的实验表明,我们的模型性能优于现有局部特征,且与先进匹配器相比达到竞争性精度,但在使用2k和4k关键点时分别快约2倍和3倍。