Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.
翻译:Transformer解码器已在多项任务中取得优异成果,但长序列下KV缓存的内存需求成为严重瓶颈。尽管跨层KV缓存共享方法(如YOCO、CLA)为缓解KV缓存瓶颈提供了途径,但其性能通常逊于GQA等层内方法。为探究根本原因,我们研究了顶层键与值的信息流动特性。初步分析揭示了清晰的分布规律:值向量主要源自底层,而键向量则从底层和中间层均获取显著信息。基于此发现,我们提出FusedKV方法,其顶层KV缓存通过可学习融合机制,整合来自底层与中间层最具信息量的键值对。该融合直接在RoPE编码后的键向量上操作,既保留了相对位置信息,又避免了重新计算旋转位置嵌入的开销。为进一步提升效率,我们提出FusedKV-Lite跨层共享方案,其顶层KV缓存直接由底层值向量与中间层键向量衍生得到。相较于FusedKV,FusedKV-Lite以困惑度轻微上升为代价降低了I/O开销。在参数量从332M到4B的大语言模型实验中,所提方法在减少50%缓存内存的同时,取得了比标准Transformer解码器更低的验证困惑度,确立了其作为内存高效、高性能架构替代方案的可行性。