Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this paper proposes a Sparse Focus Transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e. a high-level features extractor based on a convolutional neural network (CNN), a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90\% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods. The code can be available at
翻译:遥感图像变化描述旨在自动生成描述遥感双时相图像内容差异的句子。近年来,基于注意力机制的Transformer已成为捕捉全局变化特征的流行方案。然而,现有基于Transformer的遥感图像变化描述方法面临挑战,例如由Transformer编码器组件中的自注意力操作导致的高参数量和高计算复杂度。为缓解这些问题,本文针对遥感图像变化描述任务提出了一种稀疏聚焦Transformer(SFT)。具体而言,SFT网络包含三个主要组件:基于卷积神经网络(CNN)的高层特征提取器、基于稀疏聚焦注意力机制的Transformer编码器网络(用于定位和捕捉双时相图像中的变化区域),以及嵌入图像和词以生成描述差异句子的描述解码器。所提出的SFT网络通过将稀疏注意力机制融入Transformer编码器网络,能够减少参数量和计算复杂度。多种数据集上的实验结果表明,即使Transformer编码器的参数量和计算复杂度降低超过90%,我们提出的网络仍能与现有最优的遥感图像变化描述方法相比获得具有竞争力的性能。代码可在https://github.com/love-ray/SFT处获取。