HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.

翻译：自回归（AR）扩散为生成长度理论上无限的视频提供了一个有前景的框架。然而，一个主要挑战是在保持时间连续性的同时，防止由误差累积导致的渐进性质量退化。为确保连续性，现有方法通常以高度去噪的上下文为条件；然而，这种做法会以高确定性传播预测误差，从而加剧退化。在本文中，我们认为高度干净的上下文是不必要的。受双向扩散模型的启发——它们在共享噪声水平下对帧进行去噪同时保持连贯性——我们提出，以与当前块相同噪声水平的上下文为条件，可以为时间一致性提供足够信号，同时有效缓解误差传播。基于这一见解，我们提出了HiAR，一个分层去噪框架，它反转了传统的生成顺序：不是顺序完成每个块，而是在每个去噪步骤中对所有块执行因果生成，从而每个块始终以相同噪声水平的上下文为条件。这种分层结构自然地支持流水线并行推理，在我们的4步设置中实现了1.8倍的挂钟加速。我们进一步观察到，在此范式下的自展开蒸馏放大了由模式寻求的反向KL目标固有的低运动捷径。为抵消此效应，我们在双向注意力模式下引入了一个正向KL正则器，它在不干扰蒸馏损失的情况下，为因果推理保留了运动多样性。在VBench（20秒生成）上，HiAR在所有对比方法中取得了最佳综合得分和最低的时间漂移。