The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.
翻译:大型语言模型(LLM)推理的计算难题仍然是其广泛部署的主要障碍。许多应用需要支持长输入序列并以大批量进行处理,这通常导致令牌生成过程受限于数据传输。为此,我们提出SparQ Attention技术,该技术通过选择性获取缓存历史记录,在注意力层内更高效地利用内存带宽,从而提升LLM的推理吞吐量。所提出的技术可直接应用于现成LLM的推理过程,无需修改预训练设置或进行额外微调。通过在广泛的下游任务上评估Llama 2与3、Mistral、Gemma及Pythia模型,我们证明SparQ Attention可减少高达8倍的注意力数据传输量,且不会导致精度显著下降。