Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning

Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training.To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256*256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.

翻译：掩码自回归扩散模型（MAR）兼具扩散模型的强大表达能力与掩码自回归排序的灵活性。然而，原始MAR因其分层推理机制——外层自回归解掩循环与内层扩散去噪链——导致推理速度缓慢。这种解耦结构不仅损害生成效率，更阻碍了MAR在强化学习（RL）这一日益重要的生成模型后训练范式中的实际应用。为从根本上解决此问题，我们提出MARVAL（掩码自回归变分加速框架），这是一种基于蒸馏的框架，可将扩散链压缩为单一自回归生成步骤，同时保持灵活的自回归解掩顺序。通过MARVAL进行的蒸馏不仅实现显著的推理加速，更重要的是使具备可验证奖励的强化学习后训练变得切实可行，从而产生可扩展且符合人类偏好的快速生成模型。我们的贡献包括：（1）提出一种新颖的基于分数的变分目标，可在不牺牲样本质量的前提下将掩码自回归扩散模型蒸馏为单步生成；（2）通过MARVAL-RL构建针对掩码自回归模型的高效强化学习框架。在ImageNet 256×256数据集上，MARVAL-Huge以超过30倍的加速比实现2.00的FID分数，相较MAR-diffusion显著提升；MARVAL-RL在包含实体名称的ImageNet数据集上持续提升CLIP与图像奖励分数。综上所述，MARVAL首次为掩码自回归扩散模型的蒸馏与强化学习提供了实用路径，实现了快速采样与更优的偏好对齐。