OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu,Pengfei Zheng,Ruiran Yan,Shitao Xiao,Xin Luo,Yueze Wang,Wanli Li,Xiyan Jiang,Yexin Liu,Junjie Zhou,Ze Liu,Ziyi Xia,Chaofan Li,Haoge Deng,Jiahao Wang,Kun Luo,Bo Zhang,Defu Lian,Xinlong Wang,Zhongyuan Wang,Tiejun Huang,Zheng Liu

In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2

翻译：本文介绍了OmniGen2，一个多功能且开源的多模态生成模型，旨在为包括文本到图像、图像编辑及上下文生成在内的多样化生成任务提供统一解决方案。与OmniGen v1不同，OmniGen2为文本和图像模态设计了两种不同的解码路径，采用非共享参数和分离式图像分词器。这一设计使得OmniGen2能够基于现有的多模态理解模型进行构建，而无需重新适配VAE输入，从而保留了原有的文本生成能力。为促进OmniGen2的训练，我们构建了全面的数据构建流水线，涵盖图像编辑和上下文生成数据。此外，我们引入了一种针对图像生成任务的反射机制，并基于OmniGen2精心构建了专用反射数据集。尽管其参数量相对较小，OmniGen2在文本到图像和图像编辑等多个任务基准上取得了具有竞争力的结果。为了进一步评估上下文生成（即主题驱动任务），我们引入了一个名为OmniContext的新基准。OmniGen2在一致性方面达到了开源模型中的最优性能。我们将发布模型、训练代码、数据集及数据构建流水线，以支持该领域的未来研究。项目页面：https://vectorspacelab.github.io/OmniGen2；GitHub链接：https://github.com/VectorSpaceLab/OmniGen2