Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

Autoregressive video diffusion models enable streaming generation but often degrade over long rollouts: static scene layouts drift, while mechanisms that improve spatial stability tend to suppress motion, causing natural flows such as water, fire, or smoke to stagnate. We study this stability-motion trade-off in fixed-camera long-horizon nature video generation, where the two failure modes can be more clearly separated than in moving-camera settings. We propose Steady-Forcing, a memory and training framework combining a persistent visual anchor (V-Sink), an exponential moving-average motion memory (EMA-Sink), block-relative temporal encoding, periodic cache purification, and distillation from a Wan2.1-14B teacher with motion-rewarded priors under task-focused configurations. Together, these components are designed to preserve background identity while sustaining visually plausible fluid dynamics over multi-minute autoregressive rollouts. Evaluations across seven baselines show that Steady-Forcing improves long horizon background consistency and imaging quality, while a blind user study indicates stronger perceived stability and motion continuity. The benchmark evaluation further suggest that generic VBench aggregate scores under-penalize fixed-camera artifacts as well as rewarding drift-induced optical flow as Dynamic Degree while not directly penalizing texture hardening or flow stagnation - motivating future task-specific benchmarks for static-camera nature-flow evaluation. Project page: https://minar09.github.io/steadyforcing/

翻译：自回归视频扩散模型支持流式生成，但在长时间展开过程中常出现退化：静态场景布局发生漂移，而提升空间稳定性的机制往往抑制运动，导致水、火、烟雾等自然流动停滞。我们针对固定摄像头长时域自然视频生成中的这一稳定性-运动权衡问题展开研究——在此场景下，两种失效模式相较于运动摄像头设置更易区分。本文提出稳态推动（Steady-Forcing）框架，该框架包含持久视觉锚点（V-Sink）、指数移动平均运动记忆（EMA-Sink）、分块相对时序编码、周期性缓存净化，以及基于Wan2.1-14B教师模型的知识蒸馏（结合任务导向配置下的运动奖励先验）。这些组件协同作用，旨在保持背景特征的同时，在数分钟的自回归展开过程中维持视觉上合理的流体动力学特性。在七个基线模型上的评估表明，Steady-Forcing提升了长时域背景一致性与成像质量；盲测用户研究进一步证实了感知稳定性与运动连续性的显著增强。基准测试还表明，通用VBench综合评分对固定摄像头伪影的惩罚不足，同时将漂移诱导的光流奖励为动态程度（Dynamic Degree），却未直接惩罚纹理硬化或流动停滞——这为未来针对静态摄像头自然流动评估的专用基准测试提供了动机。项目页面：https://minar09.github.io/steadyforcing/