In large language models built upon the Transformer architecture, recent studies have shown that inter-head interaction can enhance attention performance. Motivated by this, we propose Multi-head Explicit Attention (MEA), a simple yet effective attention variant that explicitly models cross-head interaction. MEA consists of two key components: a Head-level Linear Composition (HLC) module that separately applies learnable linear combinations to the key and value vectors across heads, thereby enabling rich inter-head communication; and a head-level Group Normalization layer that aligns the statistical properties of the recombined heads. MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence, ultimately resulting in lower validation loss and improved performance across a range of tasks. Furthermore, we explore the parameter efficiency of MEA by reducing the number of attention heads and leveraging HLC to reconstruct them using low-rank "virtual heads". This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss on knowledge-intensive and scientific reasoning tasks, and only a 3.59% accuracy drop for Olympiad-level mathematical benchmarks.
翻译:基于Transformer架构构建的大语言模型中,近期研究表明头间交互能够提升注意力机制的性能。受此启发,我们提出多头显式注意力(MEA),这是一种简单而有效的注意力变体,能够显式建模跨头交互。MEA包含两个核心组件:头级线性组合模块(HLC),该模块分别对跨头的键向量和值向量应用可学习的线性组合,从而实现丰富的头间通信;以及头级组归一化层,用于对齐重组后头向量的统计特性。MEA在预训练中表现出强大的鲁棒性,允许使用更大的学习率以加速收敛,最终在多个任务上实现更低的验证损失和更优的性能。此外,我们通过减少注意力头数量并利用HLC模块通过低秩“虚拟头”重构它们,探索了MEA的参数效率。这实现了一种实用的键值缓存压缩策略,在知识密集型任务和科学推理任务上以可忽略的性能损失将KV缓存内存使用量降低50%,在奥林匹克级数学基准测试中仅产生3.59%的准确率下降。