D2S: Representing sparse descriptors and 3D coordinates for camera relocalization

State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant costs in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent complex local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a simple outdoor dataset to evaluate the capabilities of visual localization methods in scene-specific generalization and self-updating from unlabeled observations. Our approach outperforms the state-of-the-art CNN-based methods in scene coordinate regression in indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts, even in the absence of the labeled data sources. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s.

翻译：当前最先进的视觉定位方法大多依赖于复杂的流程来匹配局部描述符与三维点云。然而，这些流程在推理、存储以及随时间更新方面可能产生显著成本。在本研究中，我们提出了一种直接基于学习的方法，利用名为D2S的简单网络来表示复杂的局部描述符及其场景坐标。我们的方法以其简洁性和成本效益为特点。在测试阶段，它仅利用单张RGB图像进行定位，并且仅需一个轻量级模型即可编码复杂的稀疏场景。所提出的D2S结合了简单的损失函数与图注意力机制，选择性地聚焦于鲁棒的描述符，同时忽略诸如云层、树木及若干动态物体等区域。这种选择性注意力使D2S能够有效地对稀疏描述符执行二元语义分类。此外，我们提出了一个简单的户外数据集，用于评估视觉定位方法在场景特定泛化能力及从未标注观测中自我更新的性能。我们的方法在室内外环境的场景坐标回归任务中超越了当前最先进的基于CNN的方法。该方法展示了在训练数据之外的泛化能力，包括从白天到夜晚的场景转换以及适应域偏移的场景，即使在缺乏标注数据源的情况下也能实现。源代码、训练模型、数据集及演示视频可通过以下链接获取：https://thpjp.github.io/d2s。