State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant cost in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a new outdoor dataset to evaluate the capabilities of visual localization methods in terms of scene generalization and self-updating from unlabeled observations. Our approach outperforms the state-of-the-art CNN-based methods in scene coordinate regression in indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts, even in the absence of the labeled data sources. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s
翻译:现有视觉定位方法大多依赖复杂流程来匹配局部描述符与三维点云,然而这些流程在推理、存储和持续更新方面会带来显著成本。本研究提出一种基于直接学习的方法,利用名为D2S的轻量级网络来同时表示局部描述符及其场景坐标。该方法以简洁性和高性价比为特征,测试阶段仅需单张RGB图像即可完成定位,且仅需轻量级模型即可编码复杂稀疏场景。所提出的D2S网络采用简单损失函数与图注意力机制的组合,能够选择性关注鲁棒性强的描述符,同时忽略云层、树木及若干动态物体等区域。这种选择注意力机制使得D2S能够有效对稀疏描述符执行二值语义分类。此外,我们提出一个新户外数据集,用于评估视觉定位方法在场景泛化及基于无标注观测进行自更新方面的能力。在室内外场景的坐标回归任务中,本方法超越了当前最先进的基于CNN的方法。即使在缺乏标注数据源的情况下,该方法依然展现出超越训练数据的泛化能力,包括应对昼夜转换和域适应等场景。源代码、预训练模型、数据集及演示视频已发布于:https://thpjp.github.io/d2s