Natural language instructions are a powerful interface for editing the outputs of text-to-image diffusion models. However, several challenges need to be addressed: 1) underspecification (the need to model the implicit meaning of instructions) 2) grounding (the need to localize where the edit has to be performed), 3) faithfulness (the need to preserve the elements of the image not affected by the edit instruction). Current approaches focusing on image editing with natural language instructions rely on automatically generated paired data, which, as shown in our investigation, is noisy and sometimes nonsensical, exacerbating the above issues. Building on recent advances in segmentation, Chain-of-Thought prompting, and visual question answering, we significantly improve the quality of the paired data. In addition, we enhance the supervision signal by highlighting parts of the image that need to be changed by the instruction. The model fine-tuned on the improved data is capable of performing fine-grained object-centric edits better than state-of-the-art baselines, mitigating the problems outlined above, as shown by automatic and human evaluations. Moreover, our model is capable of generalizing to domains unseen during training, such as visual metaphors.
翻译:自然语言指令是编辑文本到图像扩散模型输出的强大界面。然而,这需要解决几个挑战:1)欠指定性(需要建模指令的隐含含义),2)定位(需要确定编辑执行的位置),3)忠实性(需要保留未被编辑指令影响的图像元素)。当前专注于通过自然语言指令进行图像编辑的方法依赖于自动生成的配对数据,但我们研究发现这些数据包含噪声且有时毫无意义,加剧了上述问题。基于分割、思维链提示和视觉问答的最新进展,我们显著提升了配对数据的质量。此外,我们通过高亮显示指令需要改变的图像区域来增强监督信号。在改进数据上微调的模型能够比现有最先进的基线方法更好地执行细粒度以对象为中心的编辑,自动评估和人工评估均证实其缓解了上述问题。同时,我们的模型能够泛化到训练中未见过的领域,例如视觉隐喻。