Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the \textit{spectral bias} of 3D positional embeddings and the lack of \textit{dynamic priors} in noise sampling. To address these issues, we propose \textbf{FLEX} (\textbf{F}requency-aware \textbf{L}ength \textbf{EX}tension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at $6\times$ extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at $12\times$ scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at \href{https://ga-lee.github.io/FLEX_demo}{https://ga-lee.github.io/FLEX}.

翻译：自回归视频扩散模型已成为长视频生成的可扩展范式。然而，它们常常遭受严重的外推失败，即在超越训练时域范围时，误差的快速累积会导致显著的时间退化。我们发现，这种失败主要源于三维位置编码的**频谱偏差**以及噪声采样中**动态先验**的缺失。为解决这些问题，我们提出了**FLEX**（**F**requency-aware **L**ength **EX**tension），一种无需训练、在推理时使用的框架，旨在弥合短期训练与长期推理之间的差距。FLEX引入了频率感知RoPE调制，以自适应地插值训练不足的低频分量，同时外推高频分量，从而保持多尺度时间可区分性。该方法与反相位噪声采样（ANS）相结合，以注入高频动态先验，并通过仅推理注意力汇来锚定全局结构。在VBench上的大量评估表明，在$6\times$外推（30秒时长）下，FLEX显著优于最先进的模型，并在$12\times$尺度（60秒时长）上与经过长视频微调的基线模型性能相当。作为一种即插即用的增强方法，FLEX可无缝集成到现有的推理流程中，以实现时域扩展。它有效突破了如LongLive等模型的生成极限，支持在4分钟尺度上生成连贯且动态的视频。项目页面位于\href{https://ga-lee.github.io/FLEX_demo}{https://ga-lee.github.io/FLEX}。