Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.
翻译:自回归视频合成为无限时域生成提供了有前途的路径,但其根本受制于三个相互交织的挑战:上下文限制导致的语义遗忘、位置外推引发的视觉漂移,以及交互式指令切换过程中的可控性丧失。现有方法通常孤立处理这些问题,限制了长期连贯性。我们提出有根强迫这一新颖框架,通过三种互锁机制弥合时间无关语义与近端动力学之间的鸿沟。首先,为应对语义遗忘,我们提出双记忆键值缓存,将局部时间动态与全局语义锚点解耦,确保长期语义连贯性与身份稳定性。其次,为抑制视觉漂移,我们设计双参考旋转位置编码注入,将位置嵌入限制在训练流形内,同时使全局语义保持时间不变性。第三,为解决可控性问题,我们开发非对称近端重缓存,通过近端加权缓存更新促进提示转换过程中的平滑语义继承。这些组件协同作用,将生成过程锚定于稳定语义核心,同时容纳灵活的局部动态。大量实验表明,有根强迫显著增强了长程一致性与视觉稳定性,为交互式长视频合成奠定了坚实基础。