Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights, suggesting avenues for efficiently solving new tasks. For many tasks, e.g., linear regression, the data factorizes: examples are independent given a task latent that generates the data, e.g., linear coefficients. While an optimal predictor leverages this factorization by inferring task latents, it is unclear if Transformers implicitly do so or if they instead exploit heuristics and statistical shortcuts enabled by attention layers. Both scenarios have inspired active ongoing work. In this paper, we systematically investigate the effect of explicitly inferring task latents. We minimally modify the Transformer architecture with a bottleneck designed to prevent shortcuts in favor of more structured solutions, and then compare performance against standard Transformers across various ICL tasks. Contrary to intuition and some recent works, we find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance, in general. Curiously, we find that while the bottleneck effectively learns to extract latent task variables from context, downstream processing struggles to utilize them for robust prediction. Our study highlights the intrinsic limitations of Transformers in achieving structured ICL solutions that generalize, and shows that while inferring the right latents aids interpretability, it is not sufficient to alleviate this problem.
翻译:大型自回归模型(如Transformer)能够通过上下文学习(ICL)解决任务而无需学习新的权重,这为高效解决新任务提供了可能途径。对于许多任务(例如线性回归),数据具有可分解性:在生成数据的任务潜在变量(如线性系数)给定的条件下,样本是相互独立的。虽然最优预测器通过推断任务潜在变量来利用这种分解,但Transformer是否隐式地执行此过程,抑或是通过注意力层启发的启发式方法和统计捷径来利用数据,目前尚不明确。这两种情况都激发了当前活跃的研究工作。本文中,我们系统性地研究了显式推断任务潜在变量的效果。我们通过引入一个旨在阻止捷径并促进结构化解决方案的瓶颈层,对Transformer架构进行了最小化修改,随后在各种ICL任务中将其性能与标准Transformer进行比较。与直觉和近期部分研究相反,我们发现两者之间几乎没有明显差异;偏向于任务相关的潜在变量通常并不会带来更好的分布外性能。有趣的是,我们发现尽管瓶颈层能有效学习从上下文中提取潜在任务变量,但下游处理过程难以利用这些变量进行稳健预测。我们的研究揭示了Transformer在实现可泛化的结构化ICL解决方案方面的固有局限性,并表明尽管推断正确的潜在变量有助于可解释性,但不足以缓解这一问题。