Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.
翻译:生成式大语言模型(LLMs)开启了诸多全新可能,但由于其巨大的计算需求,普及应用仍面临挑战。一些最实用的应用场景需要同时处理大量样本并使用长上下文,这两者都显著增加了模型的内存通信负载。我们提出SparQ Attention技术,通过选择性获取缓存历史记录来降低注意力模块内的内存带宽需求,从而提升LLMs的推理吞吐量。该技术可直接应用于现成的LLMs推理过程,无需修改预训练设置或额外微调。通过在Llama 2和Pythia模型上对广泛下游任务的评估,我们展示了SparQ Attention如何在不损失准确性的前提下,将注意力内存带宽需求降低至原来的八分之一。