Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and also removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task which will be released soon. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.
翻译:图像修复任务是指从图像中擦除不需要的像素,并以语义一致且真实的方式填充缺失区域。传统方法中,需要擦除的像素通过二进制掩码定义。从应用角度来看,用户需手动生成待移除物体的掩码,这一过程既耗时又容易出错。本文致力于研究一种基于自然语言输入即可同步识别待移除物体并执行擦除的图像修复算法。为此,我们首先构建了面向该任务的GQA-Inpaint数据集(即将发布),随后提出了一种新型修复框架Inst-Inpaint,该框架能够根据文本指令从图像中移除指定物体。我们建立了多种基于生成对抗网络(GAN)与扩散模型的基线方法,并在合成图像与真实图像数据集上开展实验。通过多项评估指标(衡量模型质量与精度)对不同方法进行对比,实验结果表明该方法在定量与定性层面均实现了显著提升。