Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Peng Gao,Le Zhuo,Dongyang Liu,Ruoyi Du,Xu Luo,Longtian Qiu,Yuhang Zhang,Chen Lin,Rongjie Huang,Shijie Geng,Renrui Zhang,Junlin Xi,Wenqi Shao,Zhengkai Jiang,Tianshuo Yang,Weicai Ye,He Tong,Jingwen He,Yu Qiao,Hongsheng Li

from arxiv, Technical Report; Code at: https://github.com/Alpha-VLLM/Lumina-T2X

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.

翻译：Sora揭示了扩展扩散Transformer在任意分辨率、宽高比和时长下生成逼真图像与视频的潜力，但其仍缺乏充分的实现细节。本技术报告提出Lumina-T2X系列——一系列配备零初始化注意力的流式大规模扩散Transformer（Flag-DiT），作为统一框架，旨在将噪声转化为受文本指令条件约束的图像、视频、多视角3D物体及音频片段。通过将潜在时空空间进行分词化处理，并融入诸如[nextline]和[nextframe]等可学习占位符，Lumina-T2X无缝统一了不同模态在不同时空分辨率下的表示。这一统一方法支持在单一框架内针对不同模态进行训练，并能在推理过程中灵活生成任意分辨率、宽高比和长度的多模态数据。RoPE、RMSNorm和流匹配等先进技术增强了Flag-DiT的稳定性、灵活性和可扩展性，使Lumina-T2X系列模型可扩展至70亿参数，并将上下文窗口延伸至128K标记。这对通过我们的Lumina-T2I模型创建超高清图像，以及通过Lumina-T2V模型生成长达720p的视频尤为有益。值得注意的是，由50亿参数的Flag-DiT驱动的Lumina-T2I仅需6亿参数原生DiT训练计算成本的35%。我们进一步的综合分析凸显了Lumina-T2X在分辨率外推、高分辨率编辑、生成一致的3D视图以及合成具有无缝过渡的视频方面的初步能力。我们期望Lumina-T2X的开源将进一步推动生成式人工智能社区的创造力、透明度和多样性发展。