Change captioning is to describe the semantic change between a pair of similar images in natural language. It is more challenging than general image captioning, because it requires capturing fine-grained change information while being immune to irrelevant viewpoint changes, and solving syntax ambiguity in change descriptions. In this paper, we propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes and cognition ability for complex syntax structure. Concretely, we first design a neighboring feature aggregating to integrate neighboring context into each feature, which helps quickly locate the inconspicuous changes under the guidance of conspicuous referents. Then, we devise a common feature distilling to compare two images at neighborhood level and extract common properties from each image, so as to learn effective contrastive information between them. Finally, we introduce the explicit dependencies between words to calibrate the transformer decoder, which helps better understand complex syntax structure during training. Extensive experimental results demonstrate that the proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios. The code is available at https://github.com/tuyunbin/NCT.
翻译:变化描述旨在用自然语言描述一对相似图像之间的语义变化。它比常规图像描述更具挑战性,因为需要在忽略无关视角变化的同时捕捉细粒度变化信息,并解决变化描述中的句法歧义问题。本文提出一种邻域对比Transformer,以提升模型在不同场景下对各种变化的感知能力以及对复杂句法结构的认知能力。具体而言,我们首先设计邻域特征聚合模块,将邻域上下文整合到每个特征中,从而在显著参照物引导下快速定位不显著的变化。其次,我们提出公共特征蒸馏模块,在邻域层面比较两幅图像并提取各自的公共属性,进而学习两者间的有效对比信息。最后,我们引入词语间的显式依赖关系来校准Transformer解码器,以在训练过程中更好地理解复杂句法结构。大量实验结果表明,所提方法在三个不同变化场景的公开数据集上均取得了最优性能。代码已开源至https://github.com/tuyunbin/NCT。