In-context learning (ICL) refers to a remarkable capability of pretrained large language models, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with $m$ basis functions. We analyze the training dynamics of one-layer multi-head transformers to in-contextly predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i.e., template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.
翻译:上下文学习(ICL)指预训练大语言模型在推理时仅需少量示例即可学习新任务的卓越能力。然而,ICL的理论理解仍存在大量空白,尤其需要探究Transformer能否通过训练实现对提示中未见示例的泛化——这要求模型获取提示的上下文知识以支持泛化。本文通过非线性回归任务的视角,研究梯度下降训练过程中Transformer的动态特性。此处的上下文泛化可通过学习每个任务的模板函数实现,所有模板函数均处于具有$m$个基函数的线性空间中。我们分析单层多头Transformer在给定部分标注提示时对未标注输入的上下文预测训练动态,其中标签含有高斯噪声,且每个提示中的示例数量不足以确定模板。在温和假设下,我们证明单层多头Transformer的训练损失以线性速率收敛至全局最小值。此外,该Transformer能有效学习在基函数上执行岭回归。据我们所知,本研究首次通过可证明的方式展示了当提示仅包含少量查询-答案对时,Transformer能够学习上下文(即模板)信息,从而实现对未见示例和任务的泛化。