Text-guided image editing is widely needed in daily life, ranging from personal use to professional applications such as Photoshop. However, existing methods are either zero-shot or trained on an automatically synthesized dataset, which contains a high volume of noise. Thus, they still require lots of manual tuning to produce desirable outcomes in practice. To address this issue, we introduce MagicBrush (https://osu-nlp-group.github.io/MagicBrush/), the first large-scale, manually annotated dataset for instruction-guided real image editing that covers diverse scenarios: single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), which supports trainining large-scale text-guided image editing models. We fine-tune InstructPix2Pix on MagicBrush and show that the new model can produce much better images according to human evaluation. We further conduct extensive experiments to evaluate current image editing baselines from multiple dimensions including quantitative, qualitative, and human evaluations. The results reveal the challenging nature of our dataset and the gap between current baselines and real-world editing needs.
翻译:文本引导的图像编辑在日常生活及专业应用(如Photoshop)中广泛存在需求。然而,现有方法要么采用零样本学习,要么基于自动合成数据集训练,这些数据集包含大量噪声。因此,实际应用中仍需大量手动调参才能获得理想效果。为解决该问题,我们提出MagicBrush(https://osu-nlp-group.github.io/MagicBrush/)——首个面向指令引导真实图像编辑的大规模人工标注数据集,覆盖单轮、多轮、掩码提供及无掩码编辑等多样场景。该数据集包含超过1万个人工标注三元组(源图像、指令、目标图像),可支持大规模文本引导图像编辑模型的训练。我们基于MagicBrush微调InstructPix2Pix模型,人工评估显示新模型能生成更优质的图像。此外,我们通过定量、定性与人工评估等多维度实验,系统评估了当前图像编辑基线方法。结果表明,本数据集具有高挑战性,且现有基线方法与真实编辑需求之间存在显著差距。