VcT: Visual change Transformer for Remote Sensing Image Change Detection

Existing visual change detectors usually adopt CNNs or Transformers for feature representation learning and focus on learning effective representation for the changed regions between images. Although good performance can be obtained by enhancing the features of the change regions, however, these works are still limited mainly due to the ignorance of mining the unchanged background context information. It is known that one main challenge for change detection is how to obtain the consistent representations for two images involving different variations, such as spatial variation, sunlight intensity, etc. In this work, we demonstrate that carefully mining the common background information provides an important cue to learn the consistent representations for the two images which thus obviously facilitates the visual change detection problem. Based on this observation, we propose a novel Visual change Transformer (VcT) model for visual change detection problem. To be specific, a shared backbone network is first used to extract the feature maps for the given image pair. Then, each pixel of feature map is regarded as a graph node and the graph neural network is proposed to model the structured information for coarse change map prediction. Top-K reliable tokens can be mined from the map and refined by using the clustering algorithm. Then, these reliable tokens are enhanced by first utilizing self/cross-attention schemes and then interacting with original features via an anchor-primary attention learning module. Finally, the prediction head is proposed to get a more accurate change map. Extensive experiments on multiple benchmark datasets validated the effectiveness of our proposed VcT model.

翻译：现有视觉变化检测器通常采用CNN或Transformer进行特征表示学习，并专注于学习图像间变化区域的有效表示。尽管通过增强变化区域特征可获得良好性能，但这些工作仍存在局限性，主要源于忽略了对未变化背景上下文信息的挖掘。众所周知，变化检测面临的主要挑战之一是：如何在两幅图像因空间变化、光照强度等差异而呈现不同变化时，获取一致性表示。本研究表明，精心挖掘共同背景信息可为学习两幅图像的一致性表示提供重要线索，从而显著促进视觉变化检测问题。基于这一发现，我们提出一种新型视觉变化Transformer（VcT）模型。具体而言，首先利用共享骨干网络提取给定图像对的特征图，随后将特征图中每个像素视为图节点，并采用图神经网络建模结构化信息以预测粗略变化图。通过聚类算法可从该图中挖掘Top-K可靠令牌，这些令牌先通过自/交叉注意力机制增强，再经由锚点主注意力学习模块与原始特征交互，最后由预测头生成更精确的变化图。在多个基准数据集上的大量实验验证了所提VcT模型的有效性。