Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of transformers by designing specific parameter constructions, lacking a comprehensive understanding of their inherent working mechanisms post-training. In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers: multiple heads are utilized and essential in the first layer, while usually only a single head is sufficient for subsequent layers. We provide a theoretical explanation for this observation: the first layer preprocesses the context data, and the following layers execute simple optimization steps based on the preprocessed context. Moreover, we demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms. Further experimental results support our explanations. Our findings offer insights into the benefits of multi-head attention and contribute to understanding the more intricate mechanisms hidden within trained transformers.
翻译:尽管基于Transformer的模型在各种现实任务中取得了显著成功,但其内在工作机制仍缺乏深入理解。近期研究表明,对于线性回归问题,Transformer能够作为上下文学习器实现梯度下降算法,并据此发展出多种理论分析。然而,这些工作大多通过设计特定的参数构造来关注Transformer的表达能力,缺乏对训练后模型固有工作机制的系统性理解。本研究考虑稀疏线性回归问题,探究经过训练的多头Transformer如何执行上下文学习。我们通过实验发现,多头注意力的利用在不同层间呈现差异化模式:第一层中多个注意力头均被使用且不可或缺,而后续层通常仅需单个注意力头即可满足需求。我们为这一现象提供了理论解释:第一层对上下文数据进行预处理,后续层则基于预处理后的上下文执行简化的优化步骤。此外,我们证明这种“先预处理后优化”的算法能够显著优于朴素梯度下降与岭回归算法。进一步的实验结果支持了我们的解释。本研究的发现揭示了多头注意力的优势所在,并为理解训练后Transformer内部更复杂的工作机制提供了新的视角。