Geo-tagged images are publicly available in large quantities, whereas labels such as object classes are rather scarce and expensive to collect. Meanwhile, contrastive learning has achieved tremendous success in various natural image and language tasks with limited labeled data. However, existing methods fail to fully leverage geospatial information, which can be paramount to distinguishing objects that are visually similar. To directly leverage the abundant geospatial information associated with images in pre-training, fine-tuning, and inference stages, we present Contrastive Spatial Pre-Training (CSP), a self-supervised learning framework for geo-tagged images. We use a dual-encoder to separately encode the images and their corresponding geo-locations, and use contrastive objectives to learn effective location representations from images, which can be transferred to downstream supervised tasks such as image classification. Experiments show that CSP can improve model performance on both iNat2018 and fMoW datasets. Especially, on iNat2018, CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios.
翻译:尽管带有地理标签的图像可大量公开获取,但对象类别等标签却相对稀缺且收集成本高昂。与此同时,对比学习在各类有限标注数据的自然图像与语言任务中取得了巨大成功。然而,现有方法未能充分利用地理空间信息——这一信息对区分视觉上相似的物体至关重要。为在预训练、微调和推理阶段直接利用图像关联的丰富地理空间信息,我们提出了对比空间预训练(CSP),一种针对地理标签图像的自监督学习框架。我们采用双编码器分别对图像及其对应地理位置进行编码,并利用对比目标从图像中学习有效的位置表征,这些表征可迁移至图像分类等下游监督任务。实验表明,CSP能够提升模型在iNat2018和fMoW两个数据集上的性能。特别是在iNat2018数据集上,在不同标注训练数据采样比例下,CSP以10%-34%的相对提升幅度显著增强了模型性能。