Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the spectral bias of 3D positional embeddings and the lack of dynamic priors in noise sampling. To address these issues, we propose FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at https://ga-lee.github.io/FLEX_demo.

翻译：自回归视频扩散模型已成为长视频生成的可扩展范式。然而，这类模型常遭受严重的外推失败问题，即在超出训练时域范围时，误差的快速累积会导致显著的时间维度质量退化。我们发现，这一失败主要源于三维位置编码的频谱偏差以及噪声采样中动态先验的缺失。为解决这些问题，我们提出了FLEX（Frequency-aware Length EXtension，频率感知长度扩展），一种无需额外训练、在推理时应用的框架，旨在弥合短期训练与长期推理之间的差距。FLEX引入了频率感知旋转位置编码调制，能够自适应地插值训练不足的低频分量，同时外推高频分量，以保持多尺度时间可区分性。该机制与反相位噪声采样相结合，以注入高频动态先验，并辅以推理专用注意力汇来锚定全局结构。在VBench上的广泛评估表明，FLEX在6倍外推（30秒时长）上显著优于现有最先进模型，并在12倍尺度（60秒时长）上达到了经过长视频微调的基线模型的性能。作为一种即插即用的增强模块，FLEX可无缝集成到现有的推理流程中，用于时域扩展。它有效突破了如LongLive等模型的生成极限，支持在4分钟尺度上实现连贯且动态的视频合成。项目页面位于 https://ga-lee.github.io/FLEX_demo。