LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Hongjie Wang,Chih-Yao Ma,Yen-Cheng Liu,Ji Hou,Tao Xu,Jialiang Wang,Felix Juefei-Xu,Yaqiao Luo,Peizhao Zhang,Tingbo Hou,Peter Vajda,Niraj K. Jha,Xiaoliang Dai

from arxiv, 20 pages, 20 figures

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15$\times$ (11.5$\times$) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.

翻译：文本到视频生成技术增强了内容创作能力，但其计算开销巨大：扩散变换器（DiTs）的计算成本随像素数量呈二次方增长。这使得生成分钟级长度的视频极其昂贵，导致现有模型大多仅能生成10-20秒时长的视频。我们提出了一种线性复杂度的文本到视频生成框架（LinGen），其计算成本与像素数量呈线性关系。LinGen首次在单GPU上实现了不损失质量的高分辨率分钟级视频生成。该框架将计算主导且具有二次方复杂度的自注意力模块替换为名为MATE的线性复杂度模块，该模块由MA分支和TE分支构成。MA分支针对短程至长程相关性，结合了双向Mamba2模块、我们提出的令牌重排方法Rotary Major Scan，以及为生成长视频而开发的review tokens。TE分支是一种新颖的TEmporal Swin Attention模块，专注于相邻令牌与中程令牌之间的时序相关性。MATE模块解决了Mamba的邻接保持问题，显著提升了生成视频的一致性。实验结果表明，LinGen在视频质量上优于DiT（胜率为75.6%），同时实现了高达15倍（浮点运算）和11.5倍（延迟）的计算开销降低。此外，自动评估指标和人工评估均表明，我们的LinGen-4B模型生成的视频质量与最先进模型相当（相对于Gen-3、LumaLabs和Kling的胜率分别为50.5%、52.1%和49.1%）。这为生成长达小时的电影和实现实时交互式视频生成铺平了道路。我们在项目网站https://lineargen.github.io/上提供了68秒视频生成结果及更多示例。