Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

Pretrained transformers exhibit the remarkable ability of in-context learning (ICL): they can learn tasks from just a few examples provided in the prompt without updating any weights. This raises a foundational question: can ICL solve fundamentally $\textit{new}$ tasks that are very different from those seen during pretraining? To probe this question, we examine ICL's performance on linear regression while varying the diversity of tasks in the pretraining dataset. We empirically demonstrate a $\textit{task diversity threshold}$ for the emergence of ICL. Below this threshold, the pretrained transformer cannot solve unseen regression tasks, instead behaving like a Bayesian estimator with the $\textit{non-diverse pretraining task distribution}$ as the prior. Beyond this threshold, the transformer significantly outperforms this estimator; its behavior aligns with that of ridge regression, corresponding to a Gaussian prior over $\textit{all tasks}$, including those not seen during pretraining. Thus, when pretrained on data with task diversity greater than the threshold, transformers $\textit{can}$ optimally solve fundamentally new tasks in-context. Importantly, this capability hinges on it deviating from the Bayes optimal estimator with the pretraining distribution as the prior. This study also explores the effect of regularization, model capacity and task structure and underscores, in a concrete example, the critical role of task diversity, alongside data and model scale, in the emergence of ICL. Code is available at https://github.com/mansheej/icl-task-diversity.

翻译：预训练Transformer展现出卓越的上下文学习能力：它们能够仅通过提示中的少数示例学习新任务，而无需更新任何参数。这引发了一个根本性问题：上下文学习能否解决与预训练任务截然不同的全新任务？为探究此问题，我们在改变预训练数据集任务多样性的条件下，考察了上下文学习在线性回归中的表现。我们通过实验证明，上下文学习的涌现存在一个**任务多样性阈值**。当任务多样性低于该阈值时，预训练Transformer无法解决未见过的回归任务，其行为类似于以**非多样化的预训练任务分布**为先验的贝叶斯估计器。而当任务多样性超过该阈值时，Transformer的表现显著优于该估计器；其行为与岭回归一致，对应一个覆盖**所有任务**（包括预训练中未见的任务）的高斯先验。因此，当预训练数据中的任务多样性超过阈值时，Transformer**能够**以上下文学习方式最优地解决全新任务。重要的是，这一能力依赖于它偏离以预训练分布为先验的贝叶斯最优估计器。本研究还探讨了正则化、模型容量和任务结构的影响，并通过具体实例强调：在上下文学习的涌现中，任务多样性与数据和模型规模同样关键。代码见 https://github.com/mansheej/icl-task-diversity。