Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and compute requirement. We observe that there is a high amount of redundancy across heads on which tokens they pay attention to. Based on this insight, we propose Clustered Head Attention (CHAI). CHAI combines heads with a high amount of correlation for self-attention at runtime, thus reducing both memory and compute. In our experiments, we show that CHAI is able to reduce the memory requirements for storing K,V cache by up to 21.4% and inference time latency by up to 1.73x without any fine-tuning required. CHAI achieves this with a maximum 3.2% deviation in accuracy across 3 different models (i.e. OPT-66B, LLAMA-7B, LLAMA-33B) and 5 different evaluation datasets.
翻译:拥有数千亿参数的大语言模型(LLMs)已彻底改变了机器学习领域。然而,在推理时服务这些模型既消耗计算资源也占用大量内存——单个请求可能就需要多个GPU和数十GB内存。多头注意力是LLMs的关键组件之一,其内存和计算需求可占LLM总需求的50%以上。我们观察到,不同注意力头在关注哪些令牌方面存在高度冗余。基于这一发现,我们提出聚类头注意力机制(CHAI)。CHAI在运行时将高度相关的注意力头合并进行自注意力计算,从而同时减少内存和计算开销。实验表明,无需任何微调,CHAI即可将KV缓存的内存需求降低高达21.4%,并将推理延迟降低至多1.73倍。在3种不同模型(即OPT-66B、LLAMA-7B、LLAMA-33B)和5个不同评估数据集上,CHAI在实现上述优化的同时,最大精度偏差仅为3.2%。