The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort, in one way or other, to some form of preliminary preparation, training or fine-tuning, this paper explores a novel approach: We propose a preparation-free method that permits instruction-guided image editing on the fly. This approach is organized along three steps properly orchestrated that resort to image captioning and DDIM inversion, followed by obtaining the edit direction embedding, followed by image editing proper. While dispensing with preliminary preparation, our approach demonstrates to be effective and competitive, outperforming recent, state of the art models for this task when evaluated on the MAGICBRUSH dataset.
翻译:鉴于近期利用语言处理与图像处理领域协同优势所取得的显著进展,二者的结合持续引发日益增长的研究兴趣。在这些进展中,仅依据自然语言指令对图像进行编辑的任务因其极高的挑战性而备受关注。尽管当前针对该任务的研究方法均以不同形式依赖于某种前期准备、训练或微调过程,本文探索了一种全新路径:我们提出一种无需预训练的即时指令引导图像编辑方法。该方案通过精心设计的三个步骤实现:首先进行图像描述生成与DDIM逆变换,随后获取编辑方向嵌入向量,最终执行图像编辑操作。尽管完全摒弃了前期准备工作,我们的方法在MAGICBRUSH数据集上的评估表明其具有高效性与竞争力,性能超越了该任务领域近期最先进的模型。