Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing
翻译:自回归视频扩散模型已展现出显著进展,但在长视频生成过程中仍受限于棘手的线性键值缓存增长、时间重复及累积误差问题。针对这些挑战,我们提出PackForcing——一种通过新型三分区键值缓存策略高效管理生成历史的统一框架。具体而言,我们将历史上下文划分为三类不同的令牌:(1) 汇合令牌,保留早期锚帧的全分辨率以维持全局语义;(2) 中间令牌,通过融合渐进式三维卷积与低分辨率变分自编码器重编码的双分支网络,实现大规模时空压缩(令牌缩减32倍);(3) 近期令牌,保持全分辨率以确保局部时间连贯性。为在不牺牲质量的前提下严格约束内存占用,我们引入针对中间令牌的动态top-$k$上下文选择机制,并配合连续时间旋转位置嵌入调整方法,以可忽略的开销无缝重新对齐因令牌丢弃造成的位置偏移。凭借这种基于原则的分层上下文压缩,PackForcing可在单块H200 GPU上以16帧/秒生成连贯的2分钟时长、832×480分辨率视频。其键值缓存内存仅需4 GB,并实现惊人的24倍时间外推(5秒至120秒),既可在零样本模式下运行,也可仅使用5秒片段进行训练。在VBench基准上的大量结果表明,该模型在时间一致性(26.07)和动态程度(56.25)方面达到最优水平,证明短片段监督足以实现高质量长视频合成。https://github.com/ShandaAI/PackForcing