Geo-tagged images are publicly available in large quantities, whereas labels such as object classes are rather scarce and expensive to collect. Meanwhile, contrastive learning has achieved tremendous success in various natural image and language tasks with limited labeled data. However, existing methods fail to fully leverage geospatial information, which can be paramount to distinguishing objects that are visually similar. To directly leverage the abundant geospatial information associated with images in pre-training, fine-tuning, and inference stages, we present Contrastive Spatial Pre-Training (CSP), a self-supervised learning framework for geo-tagged images. We use a dual-encoder to separately encode the images and their corresponding geo-locations, and use contrastive objectives to learn effective location representations from images, which can be transferred to downstream supervised tasks such as image classification. Experiments show that CSP can improve model performance on both iNat2018 and fMoW datasets. Especially, on iNat2018, CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios.
翻译:地理标记图像可大规模公开获取,但物体类别等标签却相当稀缺且收集成本高昂。与此同时,对比学习在标注数据有限的各种自然图像和语言任务中已取得巨大成功。然而,现有方法未能充分利用地理空间信息,而这些信息对区分视觉相似的物体至关重要。为在预训练、微调和推理阶段直接利用与图像相关的地理空间信息,我们提出对比空间预训练(CSP),一种面向地理标记图像的自监督学习框架。我们采用双编码器分别对图像及其对应地理位置进行编码,并利用对比学习目标从图像中学习有效的位置表征,这些表征可迁移至图像分类等下游监督任务。实验表明,CSP能同时在iNat2018和fMoW数据集上提升模型性能。特别是在iNat2018数据集上,当采用不同比例的标注训练数据采样时,CSP可使模型性能实现10%-34%的相对提升。