We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.
翻译:我们提出了稳定视频无限(SVI)方法,能够生成具有高时间一致性、合理场景转换以及可控流式故事情节的无限长度视频。现有的长视频生成方法试图通过手工设计的抗漂移技术(例如,修改噪声调度器、帧锚定)来缓解累积误差,但它们仍局限于单一提示词的外推,只能生成具有重复动作的同质化场景。我们发现,根本挑战不仅在于误差累积,更在于训练假设(观测干净数据)与测试时自回归现实(以自生成的、易出错的输出为条件)之间存在关键差异。为弥合这一假设差距,SVI引入了错误循环微调——一种新型高效训练方法,它将扩散Transformer(DiT)自生成的错误回收利用为监督提示,从而激励DiT主动识别并修正自身错误。这是通过闭环循环注入、收集和存储错误,并以自回归方式从错误注入的反馈中学习实现的。具体而言,我们(i)注入DiT产生的历史错误以干预干净输入,在流匹配中模拟误差累积轨迹;(ii)通过一步双向积分高效近似预测,并利用残差计算误差;(iii)在离散时间步上动态将错误存入回放记忆库,以供新输入重采样使用。SVI能够在不增加推理成本的情况下,将视频时长从数秒扩展至无限,同时保持与多样化条件(例如音频、骨骼和文本流)的兼容性。我们在三个基准测试上评估了SVI,涵盖一致性、创意性和条件性设置,全面验证了其多功能性和最先进的性能。