Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.
翻译:分块自回归视频扩散模型依赖先前生成块的KV缓存来避免冗余计算,但随着视频长度增加,该缓存迅速成为内存瓶颈。将KV缓存量化为低位宽的方法虽能降低内存压力,却会损害视频质量。我们发现,这一质量下降的关键驱动因素在于注意力权重的系统性偏差:由于softmax注意力中指数函数的凸性,量化噪声会放大缓存键的贡献,我们将此现象称为Jensen偏差。该效应导致量化键从未量化的当前块中窃取注意力权重。我们推导出每个注意力得分的校正项,可在期望中消除此偏差,并基于缓存键的量化步长和查询范数在线计算。通过二阶泰勒近似,其附加计算开销可忽略不计,且无需在缓存之外额外占用内存。在MAGI-1、SkyReels-V2和HY-WorldPlay上采用INT2量化的评估表明,我们的校正方法能恢复因剧烈量化而损失的大部分质量,达到近乎BF16的视频质量,且在使用50%更少内存的情况下可超越INT4量化性能。