GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

Existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 201,005 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA's construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks. We make our dataset, automated processing framework and fine-tuned model weights publicly available on our project's GitHub repository: https://github.com/Orion-AI-Lab/GAIA.

翻译：现有视觉-语言模型主要基于网络爬取且含噪声的图文数据进行训练，对遥感这一专业领域的接触有限。这一缺陷导致其在遥感专用任务上表现不佳，因为常用数据集通常缺乏详细且科学准确的文本描述，而仅强调日期与位置等属性。为填补这一关键空白，我们提出了GAIA——一个专为多尺度、多传感器、多模态遥感图像分析设计的新型数据集。GAIA包含201,005个精心筛选的遥感图文对，涵盖了与不同空间分辨率相关的多种遥感模态。与现有遥感视觉-语言数据集不同，GAIA特别注重捕捉多样化的遥感应用场景，提供关于环境变化、自然灾害及其他多种动态现象的独特信息。该数据集在空间与时间上均呈均衡分布，覆盖全球范围，时间跨度达25年且观测时间分布均衡。GAIA的构建采用两阶段流程：(1) 从权威遥感相关来源定向爬取图像及伴随文本；(2) 通过精心设计的提示词利用GPT-4o先进的视觉-语言能力，为每幅图像生成五条高质量、科学依据充分的合成描述。我们通过包括CLIP与BLIP2模型微调在内的大量实验证明，GAIA能显著提升遥感图像分类、跨模态检索及图像描述任务的性能。本数据集、自动化处理框架及微调模型权重已在项目GitHub仓库公开：https://github.com/Orion-AI-Lab/GAIA。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

视觉语言建模遇见遥感：模型、数据集与前景展望

专知会员服务

17+阅读 · 2025年5月21日

高效视觉语言模型研究综述

专知会员服务

14+阅读 · 2025年4月18日

【博士论文】学习视觉-语言表示以实现多模态理解

专知会员服务

28+阅读 · 2025年2月8日

《遥感时序视觉语言模型》全面综述

专知会员服务

30+阅读 · 2024年12月4日