Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. However, most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. In this paper, we propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations and decorrelates different ones in a self-supervised manner, thus attaining a pair of stable image representations under distractors. Then, the model can better interact them to capture the reliable difference features for caption generation. To yield words based on the most related difference features, we further design a cross-modal contrastive regularization, which regularizes the cross-modal alignment by maximizing the contrastive alignment between the attended difference features and generated words. Extensive experiments show that our method outperforms the state-of-the-art methods on four public datasets. The code is available at https://github.com/tuyunbin/DIRL.
翻译:变化描述旨在简洁地描述一对相似图像之间的语义变化,同时对干扰因素(光照和视角变化)具有鲁棒性。在这些干扰因素下,未变化的物体常出现位置和尺度的伪变化,且某些物体可能相互重叠,导致两幅图像间的特征产生扰动和判别性下降。然而,现有方法大多直接捕捉两者间的差异,这可能导致获取易出错的差异特征。本文提出一种抗干扰表征学习网络,以自监督方式关联两幅图像表征的对应通道并解关联不同通道,从而在干扰条件下获得一对稳定的图像表征。随后,模型能更有效地交互这些表征以捕捉可靠的差异特征用于描述生成。为了基于最相关的差异特征生成词汇,我们进一步设计了跨模态对比正则化,通过最大化注意力差异特征与生成词汇间的对比对齐来规范跨模态对齐。大量实验表明,我们的方法在四个公开数据集上优于现有最优方法。代码发布于 https://github.com/tuyunbin/DIRL。