Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive experiments demonstrate that the proposed method achieves a 5$\times$ reduction in memory usage and a notable 6.6$\times$ speedup in overall throughput with only minimal visual quality loss, thereby enabling highly efficient native autoregressive image generation on resource-constrained hardware.
翻译:Janus-Pro等自回归图像生成模型虽能生成高质量图像,但由于视觉令牌数量庞大,其高昂的内存开销与持续增长的计算需求成为显著代价。尽管KV缓存压缩技术在语言建模领域已得到广泛研究,但在图像生成领域仍鲜有探索。本研究首先识别出一种独特且显著的注意力现象,我们将其定义为空间局部性与涌现语义汇聚。基于这一关键发现,我们提出了一种新颖的KV缓存压缩框架。具体而言,我们通过自适应地将注意力头解耦为两种独立类型来实现对所有视觉令牌的KV缓存压缩:对于空间局部性注意力头,本方法维持一个短时近期令牌窗口;对于语义汇聚注意力头,则策略性地保留一组高度受关注的紧凑令牌集。大量实验表明,所提方法在仅引入极小视觉质量损失的前提下,实现了内存使用量降低5倍、整体吞吐量显著提升6.6倍的性能,从而使得资源受限硬件上能够高效运行原生自回归图像生成。