Existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 201,005 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA's construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks. We make our dataset, automated processing framework and fine-tuned model weights publicly available on our project's GitHub repository: https://github.com/Orion-AI-Lab/GAIA.
翻译:现有视觉-语言模型主要基于网络爬取且含噪声的图文数据进行训练,对遥感这一专业领域的接触有限。这一缺陷导致其在遥感专用任务上表现不佳,因为常用数据集通常缺乏详细且科学准确的文本描述,而仅强调日期与位置等属性。为填补这一关键空白,我们提出了GAIA——一个专为多尺度、多传感器、多模态遥感图像分析设计的新型数据集。GAIA包含201,005个精心筛选的遥感图文对,涵盖了与不同空间分辨率相关的多种遥感模态。与现有遥感视觉-语言数据集不同,GAIA特别注重捕捉多样化的遥感应用场景,提供关于环境变化、自然灾害及其他多种动态现象的独特信息。该数据集在空间与时间上均呈均衡分布,覆盖全球范围,时间跨度达25年且观测时间分布均衡。GAIA的构建采用两阶段流程:(1) 从权威遥感相关来源定向爬取图像及伴随文本;(2) 通过精心设计的提示词利用GPT-4o先进的视觉-语言能力,为每幅图像生成五条高质量、科学依据充分的合成描述。我们通过包括CLIP与BLIP2模型微调在内的大量实验证明,GAIA能显著提升遥感图像分类、跨模态检索及图像描述任务的性能。本数据集、自动化处理框架及微调模型权重已在项目GitHub仓库公开:https://github.com/Orion-AI-Lab/GAIA。