Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions

In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood. Additionally, the degree to which these capabilities are confined to attention-based models is unclear. Furthermore, it remains to be seen whether the insights derived from these stylized settings can be extrapolated to pretrained Large Language Models (LLMs). In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks. Additionally, we find that certain attention-free models perform (almost) identically to Transformers on a range of tasks. (b) When provided a teaching sequence, i.e. a set of examples that uniquely identifies a function in a class, we show that Transformers learn more sample-efficiently. Interestingly, our results show that Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples. (c) Lastly, we show that extant LLMs, e.g. LLaMA-2, GPT-4, can compete with nearest-neighbor baselines on prediction tasks that are guaranteed to not be in their training set.

翻译：为了理解上下文学习现象，近期研究采用了一种风格化的实验框架，并证明了Transformer能够针对各类实值函数学习基于梯度的学习算法。然而，Transformer在实现学习算法方面的局限性及其学习其他形式算法的能力尚未得到充分理解。此外，这些能力是否仅限于基于注意力机制的模型仍不明确。更进一步，从这些风格化设置中获得的见解能否推广到预训练的大型语言模型（LLM）还有待观察。本研究通过以下发现朝着回答这些问题迈出了一步：（a）在包含多种布尔函数类别的测试平台上，我们发现Transformer在“较简单”任务上几乎能匹配最优学习算法，但在“较复杂”任务上性能会下降。此外，我们还发现某些无注意力机制模型在一系列任务上的表现与Transformer（几乎）完全相同。（b）当提供教学序列（即能够唯一确定某个类别中函数的示例集）时，Transformer能够学习到更具样本效率的算法。有趣的是，我们的结果表明，Transformer能够学习实现两种不同的算法来解决同一任务，并能根据上下文示例序列自适应地选择样本效率更高的算法。（c）最后，我们表明现有的大型语言模型（例如LLaMA-2、GPT-4）在保证不包含于其训练集的预测任务上，能够与最近邻基线方法竞争。