Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy. In this paper, we show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers, yielding a new attention design we call Cross-Layer Attention (CLA). With CLA, we find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA. In experiments training 1B- and 3B-parameter models from scratch, we demonstrate that CLA provides a Pareto improvement over the memory/accuracy tradeoffs which are possible with traditional MQA, enabling inference with longer sequence lengths and larger batch sizes than would otherwise be possible
翻译:键值(KV)缓存在加速基于Transformer的自回归大语言模型(LLMs)解码过程中起着关键作用。然而,当序列长度较长且批处理规模较大时,存储KV缓存所需的内存量可能变得过高。自Transformer发明以来,发现最有效的两种减小KV缓存大小的干预措施是多查询注意力(MQA)及其泛化形式——分组查询注意力(GQA)。MQA和GQA均通过修改注意力块的设计,使多个查询头共享单个键/值头,从而大幅减少不同键/值头的数量,同时仅对精度造成极小影响。本文进一步表明,通过让相邻层之间的键头和值头也实现共享,可以将多查询注意力向前推进一步,由此产生一种名为跨层注意力(CLA)的新型注意力设计。采用CLA后,我们发现KV缓存大小可再缩减2倍,同时保持与未修改的MQA几乎相同的精度。在从头训练10亿参数和30亿参数模型的实验中,我们证明CLA在传统MQA所能实现的内存/精度权衡上提供了帕累托改进,使得推理能够支持更长的序列长度和更大的批处理规模,而这在以往是无法实现的。