HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Qi Cai,Jingwen Chen,Chengmin Gao,Zijian Gong,Yehao Li,Yingwei Pan,Yi Peng,Zhaofan Qiu,Kai Yu,Yiheng Zhang,Hao Ai,Siying Bai,Yang Chen,Zhihui Chen,Fengbin Gao,Ying Guo,Dong Li,Zhen Shen,Leilei Shi,Jing Wang,Siyu Wang,Yimeng Wang,Rui Zheng,Ting Yao,Tao Mei

from arxiv, Source codes and models are available at Github: https://github.com/HiDream-ai/HiDream-O1-Image and Huggingface: https://huggingface.co/HiDream-ai/HiDream-O1-Image

The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.

翻译：视觉生成模型的演化长期受限于依赖分离式文本编码器和外部变分自编码器的碎片化架构。本报告提出HiDream-O1-Image——一种通过像素空间扩散变压器实现原生统一的生成式基础模型，开创了从模块化架构向端到端上下文视觉生成引擎的范式转变。通过将原始图像像素、文本标记和任务特定条件映射到单一共享标记空间，HiDream-O1-Image在统一变压器架构内实现了多模态输入的结构性统一。这种原生编码范式消除了对独立变分自编码器或分离式预训练文本编码器的依赖，使模型能够将多样化的生成与编辑任务视为一致的上下文推理过程。大量实验表明，HiDream-O1-Image在文本到图像生成、基于指令的编辑以及主体驱动个性化等各类生成任务中均表现卓越。值得注意的是，仅凭8B参数，HiDream-O1-Image（8B）即可达到甚至超越参数规模显著更大的现有最优模型（例如27B参数的Qwen-Image）。关键的是，为验证该范式的巨大可扩展性，我们成功将架构扩展至超过200B参数。实验结果表明，这一大规模版本HiDream-O1-Image-Pro（200B+）解锁了前所未有的生成能力与卓越性能，确立了新的最优基准。最终，HiDream-O1-Image凸显了原生统一架构的巨大潜力，并为迈向下一代多模态AI描绘了高度可扩展的路径。