Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15$\times$ (11.5$\times$) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.
翻译:文本到视频生成技术增强了内容创作能力,但其计算开销巨大:扩散变换器(DiTs)的计算成本随像素数量呈二次方增长。这使得生成分钟级长度的视频极其昂贵,导致现有模型大多仅能生成10-20秒时长的视频。我们提出了一种线性复杂度的文本到视频生成框架(LinGen),其计算成本与像素数量呈线性关系。LinGen首次在单GPU上实现了不损失质量的高分辨率分钟级视频生成。该框架将计算主导且具有二次方复杂度的自注意力模块替换为名为MATE的线性复杂度模块,该模块由MA分支和TE分支构成。MA分支针对短程至长程相关性,结合了双向Mamba2模块、我们提出的令牌重排方法Rotary Major Scan,以及为生成长视频而开发的review tokens。TE分支是一种新颖的TEmporal Swin Attention模块,专注于相邻令牌与中程令牌之间的时序相关性。MATE模块解决了Mamba的邻接保持问题,显著提升了生成视频的一致性。实验结果表明,LinGen在视频质量上优于DiT(胜率为75.6%),同时实现了高达15倍(浮点运算)和11.5倍(延迟)的计算开销降低。此外,自动评估指标和人工评估均表明,我们的LinGen-4B模型生成的视频质量与最先进模型相当(相对于Gen-3、LumaLabs和Kling的胜率分别为50.5%、52.1%和49.1%)。这为生成长达小时的电影和实现实时交互式视频生成铺平了道路。我们在项目网站https://lineargen.github.io/上提供了68秒视频生成结果及更多示例。