The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the \textit{preservation} of core elements in the source image while implementing \textit{modifications} based on the target text. However, existing metrics have a \textbf{context-blindness} problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text, biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects and attends to irrelevant editing regions of the image. We propose \texttt{AugCLIP}, a \textbf{context-aware} metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image, that preserves the source image with necessary modifications to align with target text. More specifically, using a multi-modal large language model, \texttt{AugCLIP} augments the textual descriptions of the source and target, then calculates a modification vector through a hyperplane that separates source and target attributes in CLIP space. Extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, show that \texttt{AugCLIP} aligns remarkably well with human evaluation standards, outperforming existing metrics. The code will be open-sourced for community use.
翻译:视觉语言与生成模型的发展显著推动了文本引导图像编辑技术的进步,该技术旨在保持源图像核心元素的\textit{保留},同时根据目标文本实现\textit{修改}。然而,现有评估指标存在\textbf{情境盲区}问题,即对不同源图像与目标文本组合不加区分地采用相同评估标准,导致偏向修改或保留某一方。方向性CLIP相似度作为唯一同时考虑源图像与目标文本的指标,同样偏向修改方面并关注图像中不相关的编辑区域。我们提出\texttt{AugCLIP},一种\textbf{情境感知}的评估指标,能够根据给定源图像与目标文本的具体情境自适应地协调保留与修改两方面。该指标通过推导理想编辑图像的CLIP表示来实现,该图像在保留源图像的基础上进行必要修改以对齐目标文本。具体而言,\texttt{AugCLIP}利用多模态大语言模型增强源图像与目标文本的文本描述,随后通过CLIP空间中分离源属性与目标属性的超平面计算修改向量。在涵盖多样化编辑场景的五个基准数据集上的大量实验表明,\texttt{AugCLIP}与人类评估标准高度吻合,性能优于现有指标。代码将开源供社区使用。