Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
翻译:近期实时长视频生成方法通常采用流式调优策略,试图利用短上下文(无记忆)教师模型训练长上下文学生模型。在此类框架中,学生模型执行长序列生成,却仅能获得受限于5秒短窗口的教师监督。这种结构性差异导致了关键的**学生-教师失配**问题:教师模型因无法访问长期历史信息,难以指导学生模型学习全局时序依赖关系,实质上限制了学生模型的上下文长度。为解决此问题,我们提出**上下文强制**框架,通过长上下文教师模型训练长上下文学生模型。通过确保教师模型感知完整生成历史,我们消除了监督失配,从而能够稳健训练具备长期一致性的模型。为实现极端时长(如2分钟)下的计算可行性,我们引入了上下文管理系统,将线性增长的上下文转换为**慢-快记忆**架构,显著降低视觉冗余。大量实验结果表明,本方法可实现超过20秒的有效上下文长度——比LongLive和Infinite-RoPE等前沿方法长2至10倍。通过利用这种扩展上下文,上下文强制框架在长时视频中保持卓越的一致性,在多类长视频评估指标上超越现有最优基线方法。