RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 $\times$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $\textit{k}$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP

翻译：通用基础模型已在人工智能领域取得突破性进展。在遥感领域，自监督学习（SSL）和掩码图像建模（MIM）已被用于构建基础模型。然而，这些模型主要学习低层特征，需要带标注数据进行微调，且因缺乏语言理解能力而无法应用于检索和零样本任务。为应对这些局限，我们提出了RemoteCLIP——首个面向遥感领域的视觉语言基础模型，旨在通过丰富的语义信息学习稳健的视觉特征，并与对齐的文本嵌入协同，实现无缝的下游应用。针对预训练数据稀缺问题，我们利用数据扩展技术，基于框到文本（B2C）和掩码到框（M2B）转换，将异构标注数据统一为图像-文本对格式。通过进一步整合无人机影像，我们构建了比现有数据集总和扩大12倍的预训练数据集。RemoteCLIP可应用于多种下游任务，包括遥感图像的零样本分类、线性探针、k-近邻分类、少样本分类、图像-文本检索和物体计数。在包含新引入的RemoteCount基准（用于测试物体计数能力）的16个数据集上的评估表明，RemoteCLIP在不同模型尺度下均稳定优于基线基础模型。值得注意的是，RemoteCLIP在RSITMD数据集和RSICD数据集上的平均召回率分别比最先进方法提升9.14%和8.92%。在零样本分类任务中，我们的RemoteCLIP在12个下游数据集上的平均准确率比CLIP基线最高提升6.39%。项目网站：https://github.com/ChenDelong1999/RemoteCLIP

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日