Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities.

翻译：Transformer架构通过其自注意力机制彻底改变了深度学习领域，该机制能有效捕捉上下文信息。然而，自注意力的内存占用对长序列任务构成了重大挑战。分组查询注意力（GQA）通过将查询分组并对相应的键值头进行均值池化来解决这一问题——以灵活的方式减少总体参数量和内存需求，同时不会对模型精度产生负面影响。本研究针对GQA提出了改进方案，重点关注两种突破静态分组特性的新方法：基于键分布的分组查询注意力（KDGQA）和动态键分布分组查询注意力（DGQA），这两种方法利用键头范数的信息来指导查询分配。具体而言，KDGQA在每次前向传播过程中考察键头范数的比例，而DGQA则关注训练过程中键头范数比例的动态演化。此外，我们提出扰动分组查询注意力（PGQA）作为案例研究，该方法通过从注意力图中减去噪声来引入（静态）分组形成的可变性。我们在CIFAR-10、CIFAR-100、Food101和Tiny ImageNet等数据集上对增量训练的视觉Transformer进行图像分类实验，结果表明这些变体通过更智能的自适应分组机制能够有效改进原始GQA：特别是ViT-L模型在使用DGQA时，相比GQA及其他变体实现了最高8%的精度提升。我们进一步分析了键值头数量对性能的影响，强调了利用查询-键亲和力的重要性。

相关内容

GROUP

关注 1

Group一直是研究计算机支持的合作工作、人机交互、计算机支持的协作学习和社会技术研究的主要场所。该会议将社会科学、计算机科学、工程、设计、价值观以及其他与小组工作相关的多个不同主题的工作结合起来，并进行了广泛的概念化。官网链接：https://group.acm.org/conferences/group20/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日