Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.
翻译:近期视频生成领域的进展主要由扩散模型和流匹配模型主导,这些模型能够生成高质量结果,但计算成本高昂且难以扩展。本文提出VideoAR,这是首个用于视频生成的大规模视觉自回归框架,它将多尺度下一帧预测与自回归建模相结合。VideoAR通过整合帧内VAR建模与因果下一帧预测,并辅以高效编码时空动态的三维多尺度分词器,从而解耦空间与时间依赖性。为提升长期一致性,我们提出了多尺度时序RoPE、跨帧误差校正和随机帧掩码技术,这些方法共同缓解了误差传播并稳定了时序连贯性。我们的多阶段预训练流程逐步对齐了不同分辨率和时长下的空间与时间学习。实验表明,VideoAR在自回归模型中取得了新的最先进成果:在UCF-101上将FVD从99.5提升至88.6,同时推理步骤减少超过10倍,并达到81.74的VBench评分——该成绩可与规模大一个数量级的扩散模型相竞争。这些结果证明,VideoAR缩小了自回归范式与扩散范式之间的性能差距,为未来视频生成研究提供了一个可扩展、高效且时序一致的基础框架。