Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this paper proposes a Sparse Focus Transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e. a high-level features extractor based on a convolutional neural network (CNN), a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90\% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods. The code is available at \href{https://github.com/sundongwei/SFT_chag2cap}{Lite\_Chag2cap}.
翻译:遥感图像变化描述(RSICC)旨在自动生成描述遥感双时相图像内容差异的句子。近年来,基于注意力的Transformer已成为捕捉全局变化特征的主流思路。然而,现有的基于Transformer的RSICC方法面临诸多挑战,例如由Transformer编码器组件中的自注意力操作导致的高参数量与高计算复杂度。为缓解这些问题,本文提出了一种用于RSICC任务的稀疏聚焦Transformer(SFT)。具体而言,SFT网络包含三个主要组件:基于卷积神经网络(CNN)的高层特征提取器、一种基于稀疏聚焦注意力机制的Transformer编码器网络(旨在定位并捕捉双时相图像中的变化区域),以及一个嵌入图像与词汇以生成描述差异句子的描述解码器。所提出的SFT网络通过在Transformer编码器网络中引入稀疏注意力机制,能够有效减少参数量并降低计算复杂度。在多个数据集上的实验结果表明,即使将Transformer编码器的参数量和计算复杂度降低超过90%,我们提出的网络仍能取得与其他先进RSICC方法相竞争的性能。代码发布于 \href{https://github.com/sundongwei/SFT_chag2cap}{Lite\_Chag2cap}。