Determining the location of an image anywhere on Earth is a complex visual task, which makes it particularly relevant for evaluating computer vision algorithms. Yet, the absence of standard, large-scale, open-access datasets with reliably localizable images has limited its potential. To address this issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset comprising over 5.1 million geo-referenced street view images, covering 225 countries and territories. In contrast to existing benchmarks, we enforce a strict train/test separation, allowing us to evaluate the relevance of learned geographical features beyond mere memorization. To demonstrate the utility of our dataset, we conduct an extensive benchmark of various state-of-the-art image encoders, spatial representations, and training strategies. All associated codes and models can be found at https://github.com/gastruc/osv5m.
翻译:确定地球上任一位置图像的定位是一项复杂的视觉任务,这使其在评估计算机视觉算法中尤为重要。然而,由于缺乏兼具标准化、大规模和开放获取特性且含可靠可定位图像的数据集,该领域的潜力一直受到限制。为解决这一问题,我们推出了 OpenStreetView-5M——一个包含超510万张地理参考街景图像的大规模开放获取数据集,覆盖225个国家和地区。与现有基准不同,我们严格执行训练/测试分离,从而能够评估所学地理特征的实际相关性,而非仅检验其记忆能力。为展示该数据集的实用性,我们对多种最先进的图像编码器、空间表征及训练策略进行了全面基准测试。所有相关代码和模型均可在 https://github.com/gastruc/osv5m 查阅。