Remote sensing image change caption (RSICC) aims to provide natural language descriptions for bi-temporal remote sensing images. Since Change Caption (CC) task requires both spatial and temporal features, previous works follow an encoder-fusion-decoder architecture. They use an image encoder to extract spatial features and the fusion module to integrate spatial features and extract temporal features, which leads to increasingly complex manual design of the fusion module. In this paper, we introduce a novel video model-based paradigm without design of the fusion module and propose a Mask-enhanced Video model for Change Caption (MV-CC). Specifically, we use the off-the-shelf video encoder to simultaneously extract the temporal and spatial features of bi-temporal images. Furthermore, the types of changes in the CC are set based on specific task requirements, and to enable the model to better focus on the regions of interest, we employ masks obtained from the Change Detection (CD) method to explicitly guide the CC model. Experimental results demonstrate that our proposed method can obtain better performance compared with other state-of-the-art RSICC methods. The code is available at https://github.com/liuruixun/MV-CC.
翻译:遥感图像变化描述(RSICC)旨在为双时相遥感图像提供自然语言描述。由于变化描述任务需要同时利用空间与时间特征,先前的研究通常遵循编码器-融合器-解码器架构。它们使用图像编码器提取空间特征,并借助融合模块整合空间特征并提取时间特征,这导致融合模块的人工设计日益复杂。本文提出了一种无需设计融合模块、基于视频模型的新范式,并构建了一种用于变化描述的掩码增强视频模型(MV-CC)。具体而言,我们利用现成的视频编码器同时提取双时相图像的时空特征。此外,变化描述中的变化类型是基于特定任务需求设定的;为了使模型能更好地关注感兴趣区域,我们采用从变化检测方法中获得的掩码来显式引导变化描述模型。实验结果表明,与其他先进的遥感图像变化描述方法相比,我们所提出的方法能够取得更优的性能。代码已公开于 https://github.com/liuruixun/MV-CC。