Masked autoregressive models (MAR) have emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the expressiveness of continuous tokenizers. However, when sampling individual frames, video MAR models often produce highly distorted outputs due to the lack of a structured global prior, especially when using only a few sampling steps. To address this, we propose CanvasMAR, a novel autoregressive video prediction model that predicts high-fidelity frames with few sampling steps by introducing a canvas--a blurred, global one-step prediction of the next frame that serves as a non-uniform mask during masked generation. The canvas supplies global structure early in sampling, enabling faster and more coherent frame synthesis. To further stabilize autoregressive sampling, we propose an easy-to-hard curriculum via a motion-aware sampling order that synthesizes relatively stationary regions before attending to highly dynamic ones. We also integrate compositional classifier-free guidance that jointly strengthens the canvas and temporal conditioning to improve generation fidelity. Experiments on the BAIR, UCF-101, and Kinetics-600 benchmarks demonstrate that CanvasMAR produces higher-quality videos with fewer autoregressive steps. On the challenging Kinetics-600 dataset, CanvasMAR achieves remarkable performance among autoregressive models and rivals advanced diffusion-based methods.
翻译:掩码自回归模型(MAR)已成为图像和视频生成的一种强大范式,它结合了掩码建模的灵活性与连续分词器的表达能力。然而,在逐帧采样时,视频MAR模型由于缺乏结构化的全局先验,特别是在仅使用少量采样步数时,常常产生严重失真的输出。为解决这一问题,我们提出了CanvasMAR,一种新颖的自回归视频预测模型,它通过引入画布——一种模糊的、全局的单步下一帧预测,在掩码生成过程中充当非均匀掩码——从而以少量采样步数预测出高保真度的帧。画布在采样早期提供全局结构,实现了更快且更连贯的帧合成。为进一步稳定自回归采样,我们提出了一种通过运动感知采样顺序实现的由易到难的课程学习策略,该策略在关注高度动态区域之前先合成相对静止的区域。我们还集成了组合式无分类器引导,共同强化画布和时间条件,以提高生成保真度。在BAIR、UCF-101和Kinetics-600基准测试上的实验表明,CanvasMAR能以更少的自回归步数生成更高质量的视频。在具有挑战性的Kinetics-600数据集上,CanvasMAR在自回归模型中取得了显著性能,并与先进的基于扩散的方法相媲美。