Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.
翻译:生成式大语言模型(LLMs)虽开辟了众多全新应用可能,但其高昂的计算需求仍制约着广泛应用。部分最具实用价值的场景要求同时处理大量样本并采用长上下文,这显著增加了模型的内存通信负载。本文提出SparQ注意力机制——一种通过选择性获取缓存历史记录来降低注意力模块内存带宽需求、从而提升LLMs推理吞吐量的技术。该技术可直接应用于现成LLMs的推理阶段,无需调整预训练配置或额外微调。通过评估Llama 2与Pythia模型在多种下游任务中的表现,我们证明SparQ注意力机制可在保证精度无损的前提下,将注意力内存带宽需求降低至原来的八分之一。