Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.
翻译:自回归长视频生成常采用有界记忆流式策略以提升效率,典型做法是将局部窗口(维持短期连续性)与静态早期帧汇聚(作为长期锚点)相结合。然而,这种固定分配机制导致即便当前视觉状态已与早期帧产生显著偏差时仍保留其缓存,同时丢弃了可能更具相关性的中间历史帧。由此,所保留的长期上下文可能缺乏适应性,使生成偏向过时线索;严重时,RoPE诱导的相位重对齐会均匀化跨头注意力并引发汇聚崩溃(sink collapse),导致生成内容向汇聚帧退化。为此,我们提出DySink——一种基于检索的框架,通过维护紧凑记忆库并选择视觉相关的历史帧作为动态帧汇聚。DySink将自适应检索与汇聚异常门控相结合,该门控能检测检索上下文中的过度跨头共识,并抑制易致崩溃的上下文。在分钟级视频上的实验表明,DySink在持续提升动态度的同时,显著优于强基线方法,并实现了更高的时序质量。代码与模型权重将发布在https://github.com/yebo0216best/DySink。