Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.
翻译:近年来,预训练视觉语言基础模型的发展在许多任务中取得了显著性能。然而,这类模型通常具有较强的单图像理解能力,但缺乏对多图像的理解能力。因此,它们无法直接应用于需要捕捉多图像间真实变化并用语言描述的图像变化理解任务。本文发现,现有视觉语言基础模型在直接应用于图像变化理解时表现不佳,原因在于以下问题:(1)视觉语言基础模型通常学习单图像的全局表征,而图像变化理解需要捕捉多图像间的细微差异;(2)视觉语言基础模型的图像变化理解性能受视点变化影响显著,这是由于视点改变导致物体间关系发生变化所致。针对这些问题,本文提出一种视点融合与配准方法。具体而言,我们引入融合适配器图像编码器,通过插入设计的可训练适配器和融合适配器对预训练编码器进行微调,从而有效捕捉图像对间的细微差异。此外,设计了视点配准流和语义强调模块,分别从视觉空间和语义空间降低视点变化导致的性能退化。在CLEVR-Change和Spot-the-Diff数据集上的实验结果表明,该方法在所有指标上均达到了最先进性能。