The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.
翻译:Transformer核心机制——注意力机制的训练与泛化动态仍缺乏深入探究。此外,现有分析主要聚焦于单头注意力机制。受全连接网络训练中过参数化优势的启发,本研究探讨了使用多头注意力机制可能带来的优化与泛化优势。为此,我们在数据满足适当可实现性条件的前提下,推导了梯度下降训练单层多头自注意力模型的收敛性与泛化保证。随后建立了确保可实现性成立的初始化原始条件,并进一步证明这些条件对简单分词混合模型均成立。我们预期该分析框架可推广至多种数据模型与架构变体。