State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant cost in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a new outdoor dataset to evaluate the capabilities of visual localization methods in terms of scene generalization and self-updating from unlabeled observations. Our approach outperforms the state-of-the-art CNN-based methods in scene coordinate regression in indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts, even in the absence of the labeled data sources. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s
翻译:当前最先进的视觉定位方法大多依赖复杂流程来匹配局部描述符与三维点云,但这些流程在推理、存储和持续更新方面成本高昂。本研究提出一种直接基于学习的方法——D2S,通过简洁的网络架构同时表征局部描述符及其场景坐标。该方法具有简洁性和高性价比,在测试阶段仅需单张RGB图像即可完成定位,且仅需轻量级模型编码复杂稀疏场景。D2S采用简单损失函数与图注意力机制的联合方案,选择性聚焦鲁棒描述符,同时忽略云层、树木及若干动态物体等区域。这种选择性注意力使D2S能对稀疏描述符进行有效的二值语义分类。此外,我们提出了新的室外数据集,用于评估视觉定位方法在场景泛化与基于无标注观测的自更新能力。在室内外场景坐标回归任务中,该方法超越当前最先进的基于CNN的方法。即便在缺乏标注数据源的情况下,它仍展现出超越训练数据的泛化能力——包括昼夜场景转换及域适应等场景。源代码、预训练模型、数据集及演示视频可从以下链接获取:https://thpjp.github.io/d2s