Pre-trained Vision-Language Foundation Models utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain Foundation Model (DFM), bridging the gap between the General Foundation Model (GFM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DFM. Experimental results show that our proposed dataset are highly effective for various tasks, improving upon the baseline by $8 \% \sim 16 \%$ in zero-shot classification tasks, and obtaining good results in both Vision-Language Retrieval and Semantic Localization tasks. Finally, we show successful results of training the RS Stable Diffusion model using the RS5M, uncovering more use cases of the dataset.
翻译:基于大规模图文配对数据预训练的视觉-语言基础模型展现了前所未有的图像-文本关联能力,在各类下游任务中取得了显著成果。一个关键挑战是如何利用现有基于通用物体训练的大规模预训练VLM,通过领域特定迁移完成相关下游任务。本文提出一种包含领域基础模型的新框架,弥合了通用基础模型与领域特定下游任务之间的鸿沟。此外,我们发布了遥感领域图文配对数据集RS5M,包含500万张附带英文描述的遥感图像。该数据集通过筛选公开图文配对数据集及利用预训练VLM对仅含标签的遥感数据集进行描述生成而构建,构成首个大规模遥感图文配对数据集。我们尝试在RS5M上采用多种参数高效微调方法实现领域基础模型。实验结果表明,所提数据集在各类任务中表现优异:零样本分类任务较基线提升8%~16%,视觉-语言检索与语义定位任务均取得优异结果。最后,我们展示了利用RS5M训练遥感Stable Diffusion模型的成功案例,揭示了该数据集更广泛的应用前景。