Humans can orient themselves in their 3D environments using simple 2D maps. Differently, algorithms for visual localization mostly rely on complex 3D point clouds that are expensive to build, store, and maintain over time. We bridge this gap by introducing OrienterNet, the first deep neural network that can localize an image with sub-meter accuracy using the same 2D semantic maps that humans use. OrienterNet estimates the location and orientation of a query image by matching a neural Bird's-Eye View with open and globally available maps from OpenStreetMap, enabling anyone to localize anywhere such maps are available. OrienterNet is supervised only by camera poses but learns to perform semantic matching with a wide range of map elements in an end-to-end manner. To enable this, we introduce a large crowd-sourced dataset of images captured across 12 cities from the diverse viewpoints of cars, bikes, and pedestrians. OrienterNet generalizes to new datasets and pushes the state of the art in both robotics and AR scenarios. The code and trained model will be released publicly.
翻译:人类可以利用简单的二维地图在三维环境中确定自身方位。相比之下,视觉定位算法大多依赖复杂的三维点云,这些点云的构建、存储和长期维护成本高昂。为弥合这一差距,我们提出OrienterNet——首个能够利用与人类相同的二维语义地图实现亚米级图像定位的深度神经网络。该网络通过将神经鸟瞰图与来自OpenStreetMap的开放全球地图进行匹配,估计查询图像的位置和朝向,使任何人在可用地图区域均可实现定位。OrienterNet仅通过相机位姿进行监督,却能以端到端方式学习与多种地图元素的语义匹配。为此,我们引入了一个大规模众包数据集,该数据集覆盖12个城市,包含来自汽车、自行车和行人多种视角的街景图像。OrienterNet可泛化至新数据集,并在机器人技术与增强现实场景中均达到当前最优性能。代码与预训练模型将公开。