Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{https://tiger-ai-lab.github.io/OmniEdit/}
翻译:指令引导的图像编辑方法通过在自动合成或人工标注的图像编辑对上进行扩散模型训练,已展现出显著潜力。然而,这些方法距离实际应用仍有较大差距。我们识别出导致这一差距的三个主要挑战。首先,现有模型因合成过程的偏差而编辑技能有限。其次,这些方法使用的数据集包含大量噪声和伪影,这是由于应用了如CLIP分数等简单过滤方法所致。第三,所有这些数据集均局限于单一低分辨率和固定宽高比,限制了处理真实世界用例的多样性。本文提出\omniedit,这是一个全能编辑器,能够无缝处理七种不同图像编辑任务并支持任意宽高比。我们的贡献体现在四个方面:(1) \omniedit通过利用七个不同专家模型的监督进行训练,以确保任务覆盖范围。(2) 我们采用基于大型多模态模型(如GPT-4o)提供分数的重要性采样,替代CLIP分数,以提升数据质量。(3) 我们提出一种名为EditNet的新编辑架构,大幅提高编辑成功率。(4) 我们提供不同宽高比的图像,确保模型能够处理任意实际场景中的图像。我们构建了一个包含不同宽高比图像及覆盖多任务多样化指令的测试集。自动评估与人工评估均表明,\omniedit能够显著超越所有现有模型。我们的代码、数据集和模型将在\url{https://tiger-ai-lab.github.io/OmniEdit/}公开。