Planet-scale image geolocalization remains a challenging problem due to the diversity of images originating from anywhere in the world. Although approaches based on vision transformers have made significant progress in geolocalization accuracy, success in prior literature is constrained to narrow distributions of images of landmarks, and performance has not generalized to unseen places. We present a new geolocalization system that combines semantic geocell creation, multi-task contrastive pretraining, and a novel loss function. Additionally, our work is the first to perform retrieval over location clusters for guess refinements. We train two models for evaluations on street-level data and general-purpose image geolocalization; the first model, PIGEON, is trained on data from the game of Geoguessr and is capable of placing over 40% of its guesses within 25 kilometers of the target location globally. We also develop a bot and deploy PIGEON in a blind experiment against humans, ranking in the top 0.01% of players. We further challenge one of the world's foremost professional Geoguessr players to a series of six matches with millions of viewers, winning all six games. Our second model, PIGEOTTO, differs in that it is trained on a dataset of images from Flickr and Wikipedia, achieving state-of-the-art results on a wide range of image geolocalization benchmarks, outperforming the previous SOTA by up to 7.7 percentage points on the city accuracy level and up to 38.8 percentage points on the country level. Our findings suggest that PIGEOTTO is the first image geolocalization model that effectively generalizes to unseen places and that our approach can pave the way for highly accurate, planet-scale image geolocalization systems. Our code is available on GitHub.
翻译:行星尺度的图像地理定位仍然是一项具有挑战性的问题,原因是全球各地的图像具有多样性。尽管基于视觉Transformer的方法在地理定位精度上取得了显著进展,但先前文献中的成功仅限于地标图像的狭窄分布,且性能未能推广到未见过的地点。我们提出了一种新的地理定位系统,该系统结合了语义地理单元创建、多任务对比预训练和一种新颖的损失函数。此外,我们的工作是首次通过位置聚类进行检索以优化猜测。我们训练了两个模型,分别用于街景数据和通用图像地理定位评估:第一个模型PIGEON基于Geoguessr游戏的数据进行训练,其超过40%的猜测距离目标位置全球范围内小于25公里。我们还开发了一个机器人,并在与人类进行的盲测实验中部署PIGEON,排名进入玩家前0.01%。我们进一步挑战了全球最顶尖的职业Geoguessr玩家之一,进行了六场比赛,观众数以百万计,最终赢得所有六场比赛。第二个模型PIGEOTTO的不同之处在于,它基于来自Flickr和Wikipedia的图像数据集进行训练,在多种图像地理定位基准测试中取得了最先进的结果,在城市精度水平上比之前的最佳模型高出了7.7个百分点,在国家水平上高出了38.8个百分点。我们的发现表明,PIGEOTTO是第一个有效泛化到未见地点的图像地理定位模型,且我们的方法可以为高度准确的行星尺度图像地理定位系统铺平道路。我们的代码已发布于GitHub。