InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

Recent works have explored text-guided image editing using diffusion models and generated edited images based on text prompts. However, the models struggle to accurately locate the regions to be edited and faithfully perform precise edits. In this work, we propose a framework termed InstructEdit that can do fine-grained editing based on user instructions. Our proposed framework has three components: language processor, segmenter, and image editor. The first component, the language processor, processes the user instruction using a large language model. The goal of this processing is to parse the user instruction and output prompts for the segmenter and captions for the image editor. We adopt ChatGPT and optionally BLIP2 for this step. The second component, the segmenter, uses the segmentation prompt provided by the language processor. We employ a state-of-the-art segmentation framework Grounded Segment Anything to automatically generate a high-quality mask based on the segmentation prompt. The third component, the image editor, uses the captions from the language processor and the masks from the segmenter to compute the edited image. We adopt Stable Diffusion and the mask-guided generation from DiffEdit for this purpose. Experiments show that our method outperforms previous editing methods in fine-grained editing applications where the input image contains a complex object or multiple objects. We improve the mask quality over DiffEdit and thus improve the quality of edited images. We also show that our framework can accept multiple forms of user instructions as input. We provide the code at https://github.com/QianWangX/InstructEdit.

翻译：近期研究探索了利用扩散模型进行文本引导的图像编辑，并根据文本提示生成编辑后的图像。然而，这些模型难以准确定位待编辑区域并忠实执行精确编辑。本文提出名为InstructEdit的框架，能够基于用户指令实现细粒度编辑。该框架包含三个组件：语言处理器、分割器和图像编辑器。第一组件语言处理器利用大语言模型处理用户指令，旨在解析指令并为分割器生成提示词、为图像编辑器生成描述文本。本步骤采用ChatGPT和可选的BLIP2。第二组件分割器使用语言处理器提供的分割提示词，采用最先进的分割框架Grounded Segment Anything自动生成高质量遮罩。第三组件图像编辑器利用语言处理器的描述文本和分割器的遮罩计算编辑后的图像，采用Stable Diffusion和DiffEdit的遮罩引导生成方法。实验表明，在输入图像包含复杂对象或多个对象的细粒度编辑任务中，本方法优于现有编辑方法。相比DiffEdit，我们提升了遮罩质量，从而改善了编辑图像的质量。同时证明本框架可接受多种形式的用户指令输入。代码已发布在https://github.com/QianWangX/InstructEdit。