While most prior work in video generation relies on bidirectional architectures, recent efforts have sought to adapt these models into autoregressive variants to support near real-time generation. However, such adaptations often depend heavily on teacher models, which can limit performance, particularly in the absence of a strong autoregressive teacher, resulting in output quality that typically lags behind their bidirectional counterparts. In this paper, we explore an alternative approach that uses reward signals to guide the generation process, enabling more efficient and scalable autoregressive generation. By using reward signals to guide the model, our method simplifies training while preserving high visual fidelity and temporal consistency. Through extensive experiments on standard benchmarks, we find that our approach performs comparably to existing autoregressive models and, in some cases, surpasses similarly sized bidirectional models by avoiding constraints imposed by teacher architectures. For example, on VBench, our method achieves a total score of 84.92, closely matching state-of-the-art autoregressive methods that score 84.31 but require significant heterogeneous distillation.
翻译:尽管先前大多数视频生成研究依赖于双向架构,但近期研究尝试将这些模型改造为自回归变体以支持近实时生成。然而,此类改造通常严重依赖教师模型,这可能限制性能——特别是在缺乏强大自回归教师的情况下,导致输出质量通常落后于其双向对应模型。本文探索了一种利用奖励信号指导生成过程的替代方案,从而实现更高效且可扩展的自回归生成。通过使用奖励信号引导模型,我们的方法在保持高视觉保真度和时序一致性的同时简化了训练过程。通过在标准基准测试上的大量实验,我们发现该方法与现有自回归模型性能相当,并且在某些情况下通过规避教师架构施加的限制,超越了同等规模的双向模型。例如在VBench基准上,我们的方法获得84.92的总分,与需要大量异构蒸馏且得分为84.31的先进自回归方法表现相当。