Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism

Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity imbalance.Experimental results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.

翻译：遥感影像的多模态融合是突破单源数据局限、提升地表信息提取精度的核心技术，在环境监测、城市规划等领域具有重要应用价值。针对现有方法中固定分辨率难以兼顾效率与细节、单尺度对齐缺乏语义层次性等不足，本研究提出一种融合两项关键创新的视觉-语言模型框架：动态分辨率输入策略与多尺度视觉-语言对齐机制。具体而言，DRIS采用由粗到细的方式，根据图像内容复杂度自适应分配计算资源，在保留关键细粒度特征的同时减少冗余计算开销。MS-VLAM构建了涵盖目标、局部区域和全局三个层次的对齐机制，系统性地捕捉跨模态语义一致性，缓解语义错位与粒度失衡问题。在RS-GPT4V数据集上的实验结果表明，所提框架在图像描述生成和跨模态检索等任务中，显著提升了语义理解精度与计算效率。相较于传统方法，其在图像描述任务的BLEU-4、CIDEr等评价指标，以及跨模态检索的R@10指标上均取得了更优性能。该技术框架为构建高效鲁棒的遥感多模态系统提供了新思路，为智能遥感解译的工程化应用奠定了理论基础并提供了技术指导。