Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $\it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at https://github.com/Caiyun-AI/DCFormer.
翻译:多头注意力(Multi-Head Attention, MHA)是Transformer的核心组件。在MHA中,各注意力头独立工作,导致注意力分数矩阵的低秩瓶颈和头冗余等问题。本文提出动态组合多头注意力(Dynamically Composable Multi-Head Attention, DCMHA)——一种参数与计算高效的注意力架构,通过动态组合注意力头克服MHA的缺陷并增强模型表达能力。DCMHA的核心是一个$\it{Compose}$函数,能根据输入动态转换注意力分数与权重矩阵。DCMHA可作为MHA的即插即用替代模块应用于任何Transformer架构,从而得到对应的DCFormer。在语言建模任务中,DCFormer在不同架构和模型规模下均显著优于Transformer,其性能可与计算量约1.7-2.0倍的模型相匹敌。例如,DCPythia-6.9B在预训练困惑度和下游任务评估中均超越开源Pythia-12B。代码与模型已开源至https://github.com/Caiyun-AI/DCFormer。