Instruction-based image editing holds immense potential for a variety of applications, as it enables users to perform any editing operation using a natural language instruction. However, current models in this domain often struggle with accurately executing user instructions. We present Emu Edit, a multi-task image editing model which sets state-of-the-art results in instruction-based image editing. To develop Emu Edit we train it to multi-task across an unprecedented range of tasks, such as region-based editing, free-form editing, and Computer Vision tasks, all of which are formulated as generative tasks. Additionally, to enhance Emu Edit's multi-task learning abilities, we provide it with learned task embeddings which guide the generation process towards the correct edit type. Both these elements are essential for Emu Edit's outstanding performance. Furthermore, we show that Emu Edit can generalize to new tasks, such as image inpainting, super-resolution, and compositions of editing tasks, with just a few labeled examples. This capability offers a significant advantage in scenarios where high-quality samples are scarce. Lastly, to facilitate a more rigorous and informed assessment of instructable image editing models, we release a new challenging and versatile benchmark that includes seven different image editing tasks.
翻译:基于指令的图像编辑在各类应用中具有巨大潜力,因为它使用户能够通过自然语言指令执行任意编辑操作。然而,当前该领域的模型在准确执行用户指令方面仍存在困难。我们提出Emu Edit,这是一个多任务图像编辑模型,在基于指令的图像编辑中取得了最先进的结果。为了开发Emu Edit,我们将其训练为跨空前数量的任务进行多任务学习,包括基于区域的编辑、自由形式编辑以及计算机视觉任务,所有这些都被形式化为生成任务。此外,为增强Emu Edit的多任务学习能力,我们为其提供学习到的任务嵌入,以引导生成过程朝向正确的编辑类型。这两个要素对于Emu Edit的卓越性能至关重要。进一步地,我们展示Emu Edit能够仅凭少量标注样本泛化至新任务,如图像修复、超分辨率以及编辑任务的组合。这种能力在高质量样本稀缺的场景中提供了显著优势。最后,为促进对可指令图像编辑模型进行更严谨、更全面的评估,我们发布了一个包含七类不同图像编辑任务的新型、富有挑战性的通用基准测试集。