OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{https://tiger-ai-lab.github.io/OmniEdit/}

翻译：指令引导的图像编辑方法通过在自动合成或人工标注的图像编辑对上进行扩散模型训练，已展现出显著潜力。然而，这些方法距离实际应用仍有较大差距。我们识别出导致这一差距的三个主要挑战。首先，现有模型因合成过程的偏差而编辑技能有限。其次，这些方法使用的数据集包含大量噪声和伪影，这是由于应用了如CLIP分数等简单过滤方法所致。第三，所有这些数据集均局限于单一低分辨率和固定宽高比，限制了处理真实世界用例的多样性。本文提出\omniedit，这是一个全能编辑器，能够无缝处理七种不同图像编辑任务并支持任意宽高比。我们的贡献体现在四个方面：(1) \omniedit通过利用七个不同专家模型的监督进行训练，以确保任务覆盖范围。(2) 我们采用基于大型多模态模型（如GPT-4o）提供分数的重要性采样，替代CLIP分数，以提升数据质量。(3) 我们提出一种名为EditNet的新编辑架构，大幅提高编辑成功率。(4) 我们提供不同宽高比的图像，确保模型能够处理任意实际场景中的图像。我们构建了一个包含不同宽高比图像及覆盖多任务多样化指令的测试集。自动评估与人工评估均表明，\omniedit能够显著超越所有现有模型。我们的代码、数据集和模型将在\url{https://tiger-ai-lab.github.io/OmniEdit/}公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日