We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing literature predominantly focuses on the convergence of transformers with single-/multi-head attention, our research centers on comparing their performance. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. When the number of in-context examples D increases, the prediction loss using single-/multi-head attention is in O(1/D), and the one for multi-head attention has a smaller multiplicative constant. In addition to the simplest data distribution setting, we consider more scenarios, e.g., noisy labels, local examples, correlated features, and prior knowledge. We observe that, in general, multi-head attention is preferred over single-head attention. Our results verify the effectiveness of the design of multi-head attention in the transformer architecture.
翻译:我们针对带有softmax注意力的Transformer在线性回归任务的上下文学习中的性能进行了理论分析。现有文献主要关注单头/多头注意力Transformer的收敛性,而我们的研究侧重于比较它们的性能。我们通过精确的理论分析证明,具有较大嵌入维度的多头注意力表现优于单头注意力。当上下文样本数量D增加时,使用单头/多头注意力的预测损失均为O(1/D),但多头注意力的损失具有更小的乘法常数。除了最简单的数据分布设置外,我们还考虑了更多场景,例如含噪标签、局部样本、相关特征和先验知识。我们观察到,通常情况下,多头注意力比单头注意力更优。我们的研究结果验证了Transformer架构中多头注意力设计的有效性。