Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
翻译:基于扩散Transformer(DiTs)的长视频生成受限于全注意力机制随序列长度呈二次方增长的瓶颈。由于注意力机制存在高度冗余性,其输出主要由少量查询-键值对主导。现有稀疏方法依赖于分块粗粒度估计,其精度与效率的权衡受限于分块尺寸。本文提出混合分组注意力机制(MoGA),这是一种高效的稀疏注意力方法,通过轻量级可学习的令牌路由器实现精确的令牌匹配,无需分块估计。借助语义感知路由机制,MoGA能够实现有效的长程交互。作为无需核函数的方法,MoGA可与现代注意力计算栈(包括FlashAttention与序列并行技术)无缝集成。基于MoGA架构,我们开发了高效的长视频生成模型,能够端到端地生成分钟级、多镜头、480p分辨率、24帧率的视频,其上下文长度约达58万。在多种视频生成任务上的综合实验验证了本方法的有效性。