Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
翻译:图像地理定位是一项具有挑战性的任务,旨在预测给定照片的原始地理坐标。这一问题尚未得到解决,依赖于结合视觉线索与关于世界的通用知识,从而在不同地理区域做出准确预测。我们提出了 $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$,一个稳健且公开可用的基础模型,不仅在多个开放域图像地理定位基准上达到了最先进的性能,而且在零样本设置下就取得了这一成果,其表现优于在超过400万张图像上训练的监督模型。我们的方法引入了一种元学习策略,用于通用零样本学习:通过从合成标题中预训练CLIP,将CLIP锚定在选定领域中。我们表明,该方法有效地将CLIP的通用零样本能力迁移到图像地理定位领域,在不使用固定类别集对StreetCLIP进行微调的情况下,提升了领域内的通用零样本性能。