General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 $\times$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $\textit{k}$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP
翻译:通用基础模型已在人工智能领域取得突破性进展。在遥感领域,自监督学习(SSL)和掩码图像建模(MIM)已被用于构建基础模型。然而,这些模型主要学习低层特征,需要带标注数据进行微调,且因缺乏语言理解能力而无法应用于检索和零样本任务。为应对这些局限,我们提出了RemoteCLIP——首个面向遥感领域的视觉语言基础模型,旨在通过丰富的语义信息学习稳健的视觉特征,并与对齐的文本嵌入协同,实现无缝的下游应用。针对预训练数据稀缺问题,我们利用数据扩展技术,基于框到文本(B2C)和掩码到框(M2B)转换,将异构标注数据统一为图像-文本对格式。通过进一步整合无人机影像,我们构建了比现有数据集总和扩大12倍的预训练数据集。RemoteCLIP可应用于多种下游任务,包括遥感图像的零样本分类、线性探针、k-近邻分类、少样本分类、图像-文本检索和物体计数。在包含新引入的RemoteCount基准(用于测试物体计数能力)的16个数据集上的评估表明,RemoteCLIP在不同模型尺度下均稳定优于基线基础模型。值得注意的是,RemoteCLIP在RSITMD数据集和RSICD数据集上的平均召回率分别比最先进方法提升9.14%和8.92%。在零样本分类任务中,我们的RemoteCLIP在12个下游数据集上的平均准确率比CLIP基线最高提升6.39%。项目网站:https://github.com/ChenDelong1999/RemoteCLIP