In-context learning is one of the surprising and useful features of large language models. How it works is an active area of research. Recently, stylized meta-learning-like setups have been devised that train these models on a sequence of input-output pairs $(x, f(x))$ from a function class using the language modeling loss and observe generalization to unseen functions from the same class. One of the main discoveries in this line of research has been that for several problems such as linear regression, trained transformers learn algorithms for learning functions in context. However, the inductive biases of these models resulting in this behavior are not clearly understood. A model with unlimited training data and compute is a Bayesian predictor: it learns the pretraining distribution. It has been shown that high-capacity transformers mimic the Bayesian predictor for linear regression. In this paper, we show empirical evidence of transformers exhibiting the behavior of this ideal learner across different linear and non-linear function classes. We also extend the previous setups to work in the multitask setting and verify that transformers can do in-context learning in this setup as well and the Bayesian perspective sheds light on this setting also. Finally, via the example of learning Fourier series, we study the inductive bias for in-context learning. We find that in-context learning may or may not have simplicity bias depending on the pretraining data distribution.
翻译:上下文学习是大语言模型令人惊讶且实用的特性之一,其工作机制仍是当前研究的热点。近期,研究者设计了类似元学习的典型框架:通过语言建模损失函数,基于函数类中的输入-输出对$(x, f(x))$序列训练模型,并观察其对同一函数类中未见函数的泛化能力。该领域的主要发现之一是,对于线性回归等问题,经过训练的Transformer能够学习在上下文中进行函数学习的算法。然而,导致这种行为的模型归纳偏置尚未被清晰理解。拥有无限训练数据和计算能力的模型本质上是一个贝叶斯预测器:它学习预训练数据的分布。已有研究表明,高容量Transformer在线性回归问题上能模仿贝叶斯预测器的行为。本文通过实验证据展示了Transformer在不同线性与非线性函数类中均能表现出这种理想学习者的行为。我们还将先前框架扩展至多任务场景,验证了Transformer在此设置下同样能实现上下文学习,且贝叶斯视角为此提供了理论阐释。最后,以傅里叶级数学习为例,我们研究了上下文学习的归纳偏置,发现其是否具有简单性偏置取决于预训练数据的分布特性。