Text-guided image editing is widely needed in daily life, ranging from personal use to professional applications such as Photoshop. However, existing methods are either zero-shot or trained on an automatically synthesized dataset, which contains a high volume of noise. Thus, they still require lots of manual tuning to produce desirable outcomes in practice. To address this issue, we introduce MagicBrush (https://osu-nlp-group.github.io/MagicBrush/), the first large-scale, manually annotated dataset for instruction-guided real image editing that covers diverse scenarios: single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), which supports trainining large-scale text-guided image editing models. We fine-tune InstructPix2Pix on MagicBrush and show that the new model can produce much better images according to human evaluation. We further conduct extensive experiments to evaluate current image editing baselines from multiple dimensions including quantitative, qualitative, and human evaluations. The results reveal the challenging nature of our dataset and the gap between current baselines and real-world editing needs.
翻译:文本引导的图像编辑在日常生活中的个人使用到专业应用(如Photoshop)中具有广泛需求。然而,现有方法要么是零样本的,要么是在自动合成的数据集上训练的,这导致数据集中包含大量噪声。因此,这些方法在实践中仍需要大量手动调整才能产生理想效果。为解决这一问题,我们提出了MagicBrush(https://osu-nlp-group.github.io/MagicBrush/),这是首个大规模人工标注的指令引导真实图像编辑数据集,涵盖多种场景:单轮编辑、多轮编辑、提供遮罩编辑和无遮罩编辑。MagicBrush包含超过1万个人工标注的三元组(源图像、指令、目标图像),可支持大规模文本引导图像编辑模型的训练。我们在MagicBrush上微调了InstructPix2Pix,并证明根据人工评估,新模型能够生成质量显著提升的图像。此外,我们通过定量评估、定性评估和人工评估等多个维度对当前图像编辑基线进行了广泛实验。结果表明,我们的数据集具有挑战性,且现有基线方法与现实编辑需求之间仍存在差距。