Visual localization is a fundamental task for various applications including autonomous driving and robotics. Prior methods focus on extracting large amounts of often redundant locally reliable features, resulting in limited efficiency and accuracy, especially in large-scale environments under challenging conditions. Instead, we propose to extract globally reliable features by implicitly embedding high-level semantics into both the detection and description processes. Specifically, our semantic-aware detector is able to detect keypoints from reliable regions (e.g. building, traffic lane) and suppress unreliable areas (e.g. sky, car) implicitly instead of relying on explicit semantic labels. This boosts the accuracy of keypoint matching by reducing the number of features sensitive to appearance changes and avoiding the need of additional segmentation networks at test time. Moreover, our descriptors are augmented with semantics and have stronger discriminative ability, providing more inliers at test time. Particularly, experiments on long-term large-scale visual localization Aachen Day-Night and RobotCar-Seasons datasets demonstrate that our model outperforms previous local features and gives competitive accuracy to advanced matchers but is about 2 and 3 times faster when using 2k and 4k keypoints, respectively.
翻译:视觉定位是自动驾驶和机器人等各类应用的基础性任务。现有方法侧重于提取大量且通常冗余的局部可靠特征,导致效率与精度受限,尤其在挑战性环境下的大规模场景中表现尤甚。为此,我们提出通过将高层语义隐式嵌入检测与描述过程,来提取全局可靠特征。具体而言,本语义感知检测器能够从可靠区域(如建筑物、交通车道)隐式检测关键点,并抑制不可靠区域(如天空、汽车),无需依赖显式语义标签。这不仅通过减少对表观变化敏感的特征数量,提升了关键点匹配的精度,还避免了测试阶段额外分割网络的需求。此外,本描述子经语义增强后具有更强的判别能力,能在测试时提供更多内点。尤其在长时大规模视觉定位基准数据集Aachen Day-Night与RobotCar-Seasons上的实验表明,本模型在性能上优于既有局部特征,并与先进匹配器具有相当的精度,但在使用2k和4k关键点时,速度分别提升约2倍与3倍。