Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.
翻译:多头注意力(MHA)已成为现代大规模语言模型的基石,通过并行注意力头增强了表征能力。然而,增加头数本质上会削弱单个头的能力,而现有的注意力机制——无论是标准MHA还是其变体如分组查询注意力(GQA)和分组绑定注意力(GTA)——仅简单拼接孤立头的输出,缺乏强交互。为克服这一局限,我们提出敲头注意力(KHA),使注意力头能够相互“敲击”——在缩放点积注意力之前促进跨头的特征级交互。这是通过对所有头应用共享且对角线初始化的投影矩阵实现的。对角线初始化在训练初期保留了头特定的专业化,同时允许模型逐步学习整合的跨头表征。KHA仅增加极少参数和浮点运算量,并可无缝集成到MHA、GQA、GTA及其他注意力变体中。我们通过在1T高质量词元上训练一个6.1B参数的混合专家模型(激活参数1.01B)验证了KHA。与基线注意力机制相比,KHA带来了更优且更稳定的训练动态,在下游任务中实现了更好的性能。