We present VIXEN - a technique that succinctly summarizes in text the visual differences between a pair of images in order to highlight any content manipulation present. Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model. We address the challenge of low volume of training data and lack of manipulation variety in existing image difference captioning (IDC) datasets by training on synthetically manipulated images from the recent InstructPix2Pix dataset generated via prompt-to-prompt editing framework. We augment this dataset with change summaries produced via GPT-3. We show that VIXEN produces state-of-the-art, comprehensible difference captions for diverse image contents and edit types, offering a potential mitigation against misinformation disseminated via manipulated image content. Code and data are available at http://github.com/alexblck/vixen
翻译:我们提出VIXEN——一种通过文本简洁总结成对图像间视觉差异以突出内容篡改的技术。该网络采用逐对方式线性映射图像特征,为预训练大语言模型构建软提示。针对现有图像差异描述数据集中训练数据量不足与篡改类型匮乏的问题,我们通过近期基于提示间编辑框架生成的InstructPix2Pix数据集,利用合成篡改图像进行训练。我们进一步采用GPT-3生成的变更摘要扩展该数据集。实验表明,VIXEN能为多样化的图像内容与编辑类型生成当前最优且易于理解的差异描述,为抵御通过篡改图像传播的虚假信息提供潜在方案。代码与数据详见http://github.com/alexblck/vixen