Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive (AR) video generation. Existing AR video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an LLM-based unified model for AR video generation with efficient discrete diffusion. Firstly, to fit videos with LLMs, we identify that 1D RoPE is ill-suited for visual spatiotemporal correlation modeling, and while demonstrated to be useful, naive 3D RoPE exhibits imbalanced frequency spectra. Therefore, we propose MM-RoPE, which preserves the original textual RoPE while seamlessly accommodating video data with comprehensive frequency spectra and scaled 3D positions. Secondly, to fit the video data's nature and overcome the inefficiency of next-token decoding, we adopt a parallel and mask-based discrete diffusion with the intra-frame bidirectional and inter-frame causal attention masks. Based on this attention mask, we uncover the frame-wise loss imbalance issue caused by spatial information redundancy and propose Autoregressive Discrete Diffusion Forcing, which introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. Despite using only 48 GPUs for pre-training and fine-tuning, limited data and a discrete tokenizer, Lumos-1 achieves results surpassing those of Show-o2 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.
翻译:自回归大语言模型(LLMs)统一了海量语言任务,激发了自回归(AR)视频生成的初步探索。现有AR视频生成器或偏离标准LLM架构,或依赖庞大的外部文本编码器,或因逐词元解码而产生高昂延迟。本文提出Lumos-1——一种基于LLM的统一模型,实现高效离散扩散的AR视频生成。首先,为将视频适配至LLM,我们发现1D RoPE不适用于视觉时空相关性建模,而朴素3D RoPE虽有效,却存在频率谱不均衡问题。因此,我们提出MM-RoPE,在保留原始文本RoPE的同时,通过全面频率谱和缩放3D位置无缝适配视频数据。其次,为契合视频数据特性并克服逐词元解码的低效性,我们采用基于并行掩码的离散扩散机制,结合帧内双向和帧间因果注意力掩码。基于该注意力掩码,我们发现空间信息冗余导致的逐帧损失不均衡问题,并提出自回归离散扩散强迫机制——通过兼容的推理时掩码策略引入训练时的时域管状掩码,避免质量退化。尽管仅使用48块GPU进行预训练和微调、数据有限且采用离散分词器,Lumos-1在GenEval上超越Show-o2,在VBench-I2V上超越COSMOS-Video2World,在VBench-T2V上超越OpenSoraPlan。代码和模型已开源至https://github.com/alibaba-damo-academy/Lumos。