The Remote Sensing Copy-Move Question Answering (RSCMQA) task focuses on interpreting complex tampering scenarios and inferring the relationships between objects. Currently, publicly available datasets often use randomly generated tampered images, which lack spatial logic and do not meet the practical needs of defense security and land resource monitoring. To address this, we propose a high-quality manually annotated RSCMQA dataset, Real-RSCM, which provides more realistic evaluation metrics for the identification and understanding of remote sensing image tampering. The tampered images in the Real-RSCM dataset are subtle, authentic, and challenging, posing significant difficulties for model discrimination capabilities. To overcome these challenges, we introduce a multimodal gated mixture of experts model (CM-MMoE), which guides multi-expert models to discern tampered information in images through multi-level visual semantics and textual joint modeling. Extensive experiments demonstrate that CM-MMoE provides a stronger benchmark for the RSCMQA task compared to general VQA and CMQA models. Our dataset and code are available at https://github.com/shenyedepisa/CM-MMoE.
翻译:遥感复制-移动问答(RSCMQA)任务旨在解析复杂的篡改场景并推断对象间关系。当前公开数据集多采用随机生成的篡改图像,其缺乏空间逻辑性,难以满足国防安全与国土资源监测的实际需求。为此,我们提出了高质量人工标注的RSCMQA数据集Real-RSCM,为遥感影像篡改识别与理解提供更贴近现实的评估基准。Real-RSCM数据集中的篡改图像具有隐蔽性、真实性与挑战性,对模型判别能力构成显著考验。为应对这些挑战,我们提出多模态门控专家混合模型(CM-MMoE),该模型通过多层次视觉语义与文本联合建模,引导多专家系统辨识图像中的篡改信息。大量实验表明,相较于通用VQA与CMQA模型,CM-MMoE为RSCMQA任务提供了更具竞争力的性能基准。本数据集与代码已公开于https://github.com/shenyedepisa/CM-MMoE。