Pre-trained Vision-Language Foundation Models utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain Foundation Model (DFM), bridging the gap between the General Foundation Model (GFM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DFM. Experimental results show that our proposed dataset are highly effective for various tasks, improving upon the baseline by $8 \% \sim 16 \%$ in zero-shot classification tasks, and obtaining good results in both Vision-Language Retrieval and Semantic Localization tasks. \url{https://github.com/om-ai-lab/RS5M}
翻译:利用大规模图像-文本配对数据预训练的视觉-语言基础模型展现了前所未有的图像-文本关联能力,并在各类下游任务中取得了显著成果。一个关键挑战在于如何利用现有基于通用物体预训练的大规模视觉-语言模型,实现领域特定迁移以完成领域相关下游任务。本文提出一种新框架,包含领域基础模型,旨在弥合通用基础模型与领域特定下游任务之间的鸿沟。此外,我们发布了遥感领域的图像-文本配对数据集RS5M,包含500万张带英文描述的遥感图像。该数据集通过对公开图像-文本配对数据集进行筛选,并利用预训练视觉-语言模型标注仅含标签的遥感数据集生成。这构成了首个大规模遥感图像-文本配对数据集。同时,我们在RS5M上尝试了多种参数高效微调方法以实现领域基础模型。实验结果表明,我们提出的数据集对各类任务高度有效:在零样本分类任务中,相比基线方法性能提升8%~16%;在视觉-语言检索与语义定位任务中也取得了优异结果。\url{https://github.com/om-ai-lab/RS5M}