Remote sensing image change captioning (RSICC) aims at generating human-like language to describe the semantic changes between bi-temporal remote sensing image pairs. It provides valuable insights into environmental dynamics and land management. Unlike conventional change captioning task, RSICC involves not only retrieving relevant information across different modalities and generating fluent captions, but also mitigating the impact of pixel-level differences on terrain change localization. The pixel problem due to long time span decreases the accuracy of generated caption. Inspired by the remarkable generative power of diffusion model, we propose a probabilistic diffusion model for RSICC to solve the aforementioned problems. In training process, we construct a noise predictor conditioned on cross modal features to learn the distribution from the real caption distribution to the standard Gaussian distribution under the Markov chain. Meanwhile, a cross-mode fusion and a stacking self-attention module are designed for noise predictor in the reverse process. In testing phase, the well-trained noise predictor helps to estimate the mean value of the distribution and generate change captions step by step. Extensive experiments on the LEVIR-CC dataset demonstrate the effectiveness of our Diffusion-RSCC and its individual components. The quantitative results showcase superior performance over existing methods across both traditional and newly augmented metrics. The code and materials will be available online at https://github.com/Fay-Y/Diffusion-RSCC.
翻译:摘要:遥感图像变化描述(RSICC)旨在生成类人语言,描述双时相遥感图像对之间的语义变化,为环境动态变化与土地管理提供有价值的见解。与传统变化描述任务不同,RSICC不仅需要跨不同模态检索相关信息并生成流畅的描述,还需减轻像素级差异对地形变化定位的影响。长时间跨度导致的像素问题会降低生成描述的准确性。受扩散模型卓越生成能力的启发,我们提出了一种用于RSICC的概率扩散模型以解决上述问题。在训练过程中,我们构建了一个以跨模态特征为条件的噪声预测器,使其在马尔可夫链下学习从真实描述分布到标准高斯分布的映射。同时,在逆过程中为噪声预测器设计了跨模态融合模块和堆叠自注意力模块。测试阶段,训练完成的噪声预测器可逐步估计分布均值并生成变化描述。在LEVIR-CC数据集上的大量实验证明了我们的Diffusion-RSCC及其各组成部分的有效性。定量结果表明,在传统指标与新增强化指标上,该方法均优于现有方法。代码和资料将发布于 https://github.com/Fay-Y/Diffusion-RSCC。