Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.
翻译:交互式实时自回归视频生成对于内容创作和世界建模等需要视觉内容动态适应不断变化事件条件的应用至关重要。一个根本性挑战在于平衡响应性与稳定性:模型必须及时响应新事件,同时保持长时间跨度的时序一致性。现有方法将双向模型蒸馏为自回归生成器,并通过流式长微调进一步适配,但通常在条件变化后表现出持续漂移。我们将其原因归结为条件偏差:教师模型可能提供与条件对齐但忽略轨迹的指导,从而将生成过程偏向局部有效但全局不一致的模式。受信任区域策略优化启发,我们提出Delta Forcing——一种简洁有效的框架,该框架将不可靠的教师监督约束在自适应信任区域内。具体而言,Delta Forcing通过教师与生成器轨迹间的潜在差值估计转换一致性,并利用该一致性平衡教师监督与单调连续性目标。这能抑制不可靠教师引发的偏移,同时保持对新事件的响应性。大量实验表明,Delta Forcing在维持事件响应性的同时显著提升了生成一致性。