FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data underexplored. Infrared images provide distinctive cues, including thermal intensity structures, object boundaries, and illumination-invariant scene features, which can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling is still absent. To address this gap, we introduce FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing. FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content. Based on FusionRS, we train dual-modal vision-language foundation models for RGB-IR joint understanding. We first train CLIP-style models for RGB-IR-text alignment, and then fine-tune generative VLMs for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning.

翻译：遥感视觉-语言模型推进了对地球观测的理解，但现有研究多聚焦于RGB图像，红外数据中的互补信息尚未被充分发掘。红外图像提供独特线索，包括热强度结构、物体边界及光照不变的场景特征，能丰富超越传统RGB观测的视觉-语言学习。然而，目前缺乏面向遥感视觉-语言建模的大规模RGB-红外-文本数据集。为填补这一空白，我们构建了FusionRS——首个面向遥感双模态视觉-语言学习的大规模RGB-红外-文本数据集。FusionRS通过将多样的公开RGB遥感图像转换为红外风格对应图像，形成配对的RGB-IR图像对。每对图像关联常规场景描述与红外感知描述，后者在保留语义内容的同时明确描述红外特有的视觉属性。基于FusionRS，我们训练用于RGB-IR联合理解的双模态视觉-语言基础模型。首先训练CLIP风格模型实现RGB-IR-文本对齐，随后微调生成式视觉语言模型以完成双模态RGB-IR描述生成。实验表明，相较仅用RGB图像及未引入红外感知的训练设置，FusionRS改进了RGB-IR对齐、红外到文本检索及双模态描述生成。消融研究进一步验证红外感知描述对强化红外-语言对齐至关重要，凸显了模态特定文本监督在可扩展的RGB-红外遥感视觉-语言表示学习中的重要性。