In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.
翻译:本文提出OminiControl——一个高度通用且参数高效的框架,可将图像条件集成到预训练的扩散Transformer(DiT)模型中。该框架的核心在于参数复用机制,使DiT能够利用自身作为强大骨干网络对图像条件进行编码,并通过其灵活的多模态注意力处理器进行处理。与现有方法严重依赖具有复杂架构的额外编码器模块不同,OminiControl具有以下特点:(1)仅需约0.1%的额外参数即可高效融合注入的图像条件;(2)以统一方式处理广泛的图像条件任务,包括主体驱动生成及边缘、深度等空间对齐条件。值得注意的是,这些能力通过使用DiT自身生成的图像进行训练而实现,这对主体驱动生成尤为有益。大量实验评估表明,OminiControl在主体驱动与空间对齐条件生成任务上均优于现有的基于UNet及DiT适配的模型。此外,我们开源了训练数据集Subjects200K——包含超过20万张身份一致性图像的多样化数据集,以及高效的数据合成流程,以推动主体一致性生成领域的研究。