Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.
翻译:分组查询注意力(GQA)已被广泛用于大型语言模型中,以降低多头注意力(MHA)的复杂度。将MHA转换为GQA时,通常将MHA中的相邻查询均匀划分为若干组,每组共享值层和键层。本文提出AsymGQA,一种基于激活信息的不对称分组方法,通过将MHA非对称地转换为GQA以提升模型性能。在相同模型规模预算下,我们的AsymGQA优于传统GQA。例如,AsymGQA LLaMA-2-7B在MMLU基准上的准确率较相邻分组方法提升7.5%。该方法有效解决了GQA在模型性能与硬件效率之间的权衡问题。