We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures. Current image conditioning methods either introduce substantial parameter overhead or handle only specific control tasks effectively, limiting their practical versatility. OminiControl addresses these limitations through three key innovations: (1) a minimal architectural design that leverages the DiT's own VAE encoder and transformer blocks, requiring just 0.1% additional parameters; (2) a unified sequence processing strategy that combines condition tokens with image tokens for flexible token interactions; and (3) a dynamic position encoding mechanism that adapts to both spatially-aligned and non-aligned control tasks. Our extensive experiments show that this streamlined approach not only matches but surpasses the performance of specialized methods across multiple conditioning tasks. To overcome data limitations in subject-driven generation, we also introduce Subjects200K, a large-scale dataset of identity-consistent image pairs synthesized using DiT models themselves. This work demonstrates that effective image control can be achieved without architectural complexity, opening new possibilities for efficient and versatile image generation systems.
翻译:本文提出OminiControl,一种重新思考图像条件如何融入扩散Transformer(DiT)架构的新方法。现有图像条件方法要么引入大量参数开销,要么仅能有效处理特定控制任务,限制了其实用普适性。OminiControl通过三项关键创新解决这些局限:(1)利用DiT自身VAE编码器和Transformer块的极简架构设计,仅需增加0.1%参数;(2)将条件令牌与图像令牌结合的统一序列处理策略,实现灵活的令牌交互;(3)动态位置编码机制,可自适应空间对齐与非对齐的控制任务。大量实验表明,这种精简方法在多种条件任务中不仅匹配且超越了专用方法的性能。为克服主体驱动生成中的数据限制,我们还提出了Subjects200K——一个使用DiT模型自身合成的大规模身份一致性图像对数据集。本工作证明,无需复杂架构即可实现有效的图像控制,为高效多能的图像生成系统开辟了新可能。