The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

Large language models (LLMs) are known for their exceptional performance in natural language processing, making them highly effective in many human life-related or even job-related tasks. The attention mechanism in the Transformer architecture is a critical component of LLMs, as it allows the model to selectively focus on specific input parts. The softmax unit, which is a key part of the attention mechanism, normalizes the attention scores. Hence, the performance of LLMs in various NLP tasks depends significantly on the crucial role played by the attention mechanism with the softmax unit. In-context learning, as one of the celebrated abilities of recent LLMs, is an important concept in querying LLMs such as ChatGPT. Without further parameter updates, Transformers can learn to predict based on few in-context examples. However, the reason why Transformers becomes in-context learners is not well understood. Recently, several works [ASA+22,GTLV22,ONR+22] have studied the in-context learning from a mathematical perspective based on a linear regression formulation $\min_x\| Ax - b \|_2$, which show Transformers' capability of learning linear functions in context. In this work, we study the in-context learning based on a softmax regression formulation $\min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2$ of Transformer's attention mechanism. We show the upper bounds of the data transformations induced by a single self-attention layer and by gradient-descent on a $\ell_2$ regression loss for softmax prediction function, which imply that when training self-attention-only Transformers for fundamental regression tasks, the models learned by gradient-descent and Transformers show great similarity.

翻译：大型语言模型（LLMs）以在自然语言处理中的卓越表现而闻名，使其在许多人生活相关甚至工作相关的任务中极具效率。Transformer架构中的注意力机制是LLMs的关键组成部分，它允许模型有选择性地关注特定输入部分。Softmax单元作为注意力机制的核心组成部分，对注意力分数进行归一化处理。因此，LLMs在各种自然语言处理任务中的性能显著依赖于包含Softmax单元的注意力机制所发挥的关键作用。上下文学习作为近期LLMs备受瞩目的能力之一，是在查询ChatGPT等LLMs时的重要概念。无需额外参数更新，Transformer即可基于少量上下文示例进行预测。然而，Transformer为何能成为上下文学习者的深层原因尚未被充分理解。近期，多项工作[ASA+22, GTLV22, ONR+22]基于线性回归形式$\min_x\| Ax - b \|_2$从数学角度研究了上下文学习，揭示了Transformer学习线性函数的能力。本研究基于Transformer注意力机制的Softmax回归形式$\min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2$探索上下文学习。我们证明了单层自注意力机制和基于$\ell_2$回归损失的梯度下降方法在Softmax预测函数中诱导的数据变换的上界，这表明当针对基础回归任务训练纯自注意力Transformer时，梯度下降方法学习到的模型与Transformer模型展现出高度相似性。