Self-attention has greatly contributed to the success of the widely used Transformer architecture by enabling learning from data with long-range dependencies. In an effort to improve performance, a gated attention model that leverages a gating mechanism within the multi-head self-attention has recently been proposed as a promising alternative. Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention and even to eliminate the attention sink phenomenon. Despite its efficacy, a clear theoretical understanding of gated attention's benefits remains lacking in the literature. To close this gap, we rigorously show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts. By recasting learning as an expert estimation problem, we demonstrate that gated attention is more sample-efficient than multi-head self-attention. In particular, while the former needs only a polynomial number of data points to estimate an expert, the latter requires exponentially many data points to achieve the same estimation error. Furthermore, our analysis also provides a theoretical justification for why gated attention yields higher performance when a gate is placed at the output of the scaled dot product attention or the value map rather than at other positions in the multi-head self-attention architecture.
翻译:自注意力机制通过支持学习具有长程依赖性的数据,极大地促进了广泛使用的Transformer架构的成功。为了提升性能,近期提出了一种在多头自注意力中利用门控机制的门控注意力模型,作为一种有前景的替代方案。经验研究表明,门控注意力能够增强标准注意力中低秩映射的表达能力,甚至消除注意力汇聚现象。尽管其效果显著,但文献中仍缺乏对门控注意力优势的清晰理论理解。为弥补这一空白,我们严格证明了门控注意力矩阵或多头自注意力矩阵中的每个元素均可表示为专家分层混合。通过将学习重新表述为专家估计问题,我们证明了门控注意力比多头自注意力具有更高的样本效率。具体而言,前者仅需多项式数量的数据点即可估计一个专家,而后者则需要指数级的数据点才能达到相同的估计误差。此外,我们的分析还从理论上解释了为何在缩放点积注意力或值映射的输出端放置门控,而非多头自注意力架构中的其他位置时,门控注意力能产生更高的性能。