Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regression, with each layer exactly performing one step of normalized gradient descent on an in-context loss. Then, we show that our constructed transformer can be obtained through (i) training a single self-attention layer supervised by one-step gradient descent, and (ii) recurrently applying the trained layer to obtain a looped model. Training convergence guarantees of the self-attention layer and out-of-distribution generalization guarantees of the looped model are provided. Our results advance the theoretical understanding of ICL mechanism by showcasing how softmax transformers can effectively act as in-context learners.
翻译:Transformer模型展现出显著的上下文学习能力。其强大的上下文学习性能普遍被认为源于模型能隐式地在上下文中执行特定算法,从而增强预测与生成能力。本研究探讨了采用Softmax注意力的Transformer如何在线性分类数据上执行上下文学习。我们首先构建了一类多层Transformer,可执行上下文逻辑回归——其中每层精确执行一步针对上下文损失的归一化梯度下降。接着证明,所构建的Transformer可通过以下方式获得:(i) 训练单个自注意力层并由单步梯度下降监督,(ii) 循环应用训练后的层生成循环模型。我们提供了自注意力层的训练收敛保证及循环模型的分布外泛化保证。本研究通过展示Softmax Transformer如何有效充当上下文学习器,推进了对上下文学习机制的理论理解。