Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of $(x, f(x))$ pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.
翻译:Transformer模型,特别是大型语言模型(LLMs),具有显著的情境学习(ICL)能力——在未进行显式模型训练的情况下,面对未见过的输入-输出示例时,能够执行新任务。本文研究Transformer如何有效桥接其由多个不同任务族组成的预训练数据混合,从而在情境中识别并学习这些既在预训练分布内又在分布外的新任务。基于先前工作,我们在受控环境下探究该问题,研究对象是在$(x, f(x))$对序列(而非自然语言)上训练的Transformer模型。实证结果表明,当任务族在预训练数据中充分表征时,Transformer展现出近乎最优的无监督模型选择能力——既能首先在情境中识别不同任务族,又能随后在其内部进行情境学习。然而,当面对预训练数据域外的任务或函数时,我们揭示了Transformer的各种失败模式,即便是简单的外推任务,其泛化能力也会显著退化。综上,我们的结论强调:高容量序列模型令人印象深刻的ICL能力,可能更紧密依赖于其预训练数据混合的覆盖范围,而非产生根本泛化能力的归纳偏置。