The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.
翻译:过去几年中,生成模型质量的提升使得大规模生成图像的编辑变体成为可能。为应对此类技术可能带来的负面影响,图像差异描述任务旨在描述两幅图像之间的差异。尽管该任务在简单的三维渲染图像上已能成功处理,但在真实世界图像上仍面临困难。其原因有二:训练数据稀缺,以及难以捕捉复杂图像间的细粒度差异。为解决这些问题,本文提出一个简单而有效的框架,既能将现有图像描述模型适配至IDC任务,又能增强IDC数据集。我们提出了BLIP2IDC——一种以较低计算成本将BLIP2适配至IDC任务的方法,并证明其在真实世界IDC数据集上显著优于双流方法。我们还提出以任务无关的方式利用合成增强来提升IDC模型的性能。实验表明,我们的合成增强策略能生成高质量数据,由此构建了一个适用于IDC的、具有挑战性的新数据集Syned1。