Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.
翻译:交互式实时自回归视频生成在内容创作和世界建模等应用中至关重要,其视觉内容需适应动态变化的事件条件。核心挑战在于平衡响应性与稳定性:模型需对新事件快速响应,同时保持长时域的时间连贯性。现有方法将双向模型蒸馏为自回归生成器,并通过流式长微调进一步适配,但在条件变化后常出现持续漂移。我们将其根源归结为条件偏差——教师模型可能提供与条件对齐但忽视轨迹的引导,使生成偏向局部合理但全局不一致的模式。受信任域策略优化启发,我们提出Delta Forcing这一简洁而有效的框架,将不可靠的教师监督约束于自适应信任域内。具体而言,Delta Forcing通过教师与生成器轨迹间的潜在差分估计转移一致性,并以此平衡教师监督与单调连续性目标。该方法在抑制不可靠教师引发的偏移的同时,保持对新事件的响应能力。大量实验证明,Delta Forcing在保持事件响应性的同时显著提升了连贯性。