We present VIXEN - a technique that succinctly summarizes in text the visual differences between a pair of images in order to highlight any content manipulation present. Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model. We address the challenge of low volume of training data and lack of manipulation variety in existing image difference captioning (IDC) datasets by training on synthetically manipulated images from the recent InstructPix2Pix dataset generated via prompt-to-prompt editing framework. We augment this dataset with change summaries produced via GPT-3. We show that VIXEN produces state-of-the-art, comprehensible difference captions for diverse image contents and edit types, offering a potential mitigation against misinformation disseminated via manipulated image content. Code and data are available at http://github.com/alexblck/vixen
翻译:我们提出VIXEN——一种通过文本简洁概括图像对间视觉差异以突出内容篡改的技术。该网络以成对方式线性映射图像特征,为预训练大语言模型构建软提示。针对现有图像差异描述数据集训练数据量少、篡改类型单一的问题,我们利用基于提示编辑框架生成的合成篡改图像(来自最新InstructPix2Pix数据集)进行训练,并通过GPT-3生成的变更摘要对数据集进行增强。实验表明,VIXEN能针对多样化的图像内容与编辑类型生成兼具先进性与可理解性的差异描述,为抵御通过篡改图像传播的虚假信息提供了潜在解决方案。代码与数据详见 http://github.com/alexblck/vixen