The existing methods for Remote Sensing Image Change Captioning (RSICC) perform well in simple scenes but exhibit poorer performance in complex scenes. This limitation is primarily attributed to the model's constrained visual ability to distinguish and locate changes. Acknowledging the inherent correlation between change detection (CD) and RSICC tasks, we believe pixel-level CD is significant for describing the differences between images through language. Regrettably, the current RSICC dataset lacks readily available pixel-level CD labels. To address this deficiency, we leverage a model trained on existing CD datasets to derive CD pseudo-labels. We propose an innovative network with an auxiliary CD branch, supervised by pseudo-labels. Furthermore, a semantic fusion augment (SFA) module is proposed to fuse the feature information extracted by the CD branch, thereby facilitating the nuanced description of changes. Experiments demonstrate that our method achieves state-of-the-art performance and validate that learning pixel-level CD pseudo-labels significantly contributes to change captioning. Our code will be available at: https://github.com/Chen-Yang-Liu/Pix4Cap
翻译:现有的遥感图像变化描述(RSICC)方法在简单场景中表现良好,但在复杂场景中性能较差。这一局限性主要源于模型区分和定位变化的视觉能力受限。考虑到变化检测(CD)与RSICC任务之间的固有相关性,我们认为像素级CD对于通过语言描述图像之间的差异具有重要意义。遗憾的是,当前的RSICC数据集缺乏可直接使用的像素级CD标签。为解决这一不足,我们利用在现有CD数据集上训练的模型推导出CD伪标签。我们提出了一种创新网络,包含一个辅助CD分支,该分支受伪标签监督。此外,还提出了语义融合增强(SFA)模块,用于融合CD分支提取的特征信息,从而促进对变化的细致描述。实验表明,我们的方法达到了最先进的性能,并验证了学习像素级CD伪标签对变化描述有显著贡献。我们的代码将在以下地址开源:https://github.com/Chen-Yang-Liu/Pix4Cap