Investigation into In-Context Learning Capabilities of Transformers

Transformers have demonstrated a strong ability for in-context learning (ICL), enabling models to solve previously unseen tasks using only example input output pairs provided at inference time. While prior theoretical work has established conditions under which transformers can perform linear classification in-context, the empirical scaling behavior governing when this mechanism succeeds remains insufficiently characterized. In this paper, we conduct a systematic empirical study of in-context learning for Gaussian-mixture binary classification tasks. Building on the theoretical framework of Frei and Vardi (2024), we analyze how in-context test accuracy depends on three fundamental factors: the input dimension, the number of in-context examples, and the number of pre-training tasks. Using a controlled synthetic setup and a linear in-context classifier formulation, we isolate the geometric conditions under which models successfully infer task structure from context alone. We additionally investigate the emergence of benign overfitting, where models memorize noisy in-context labels while still achieving strong generalization performance on clean test data. Through extensive sweeps across dimensionality, sequence length, task diversity, and signal-to-noise regimes, we identify the parameter regions in which this phenomenon arises and characterize how it depends on data geometry and training exposure. Our results provide a comprehensive empirical map of scaling behavior in in-context classification, highlighting the critical role of dimensionality, signal strength, and contextual information in determining when in-context learning succeeds and when it fails.

翻译：Transformer在上下文学习（ICL）方面展现出强大能力，使模型仅凭推理时提供的示例输入输出对即可解决先前未见过的任务。尽管已有理论工作确立了Transformer在上下文中执行线性分类的成立条件，但控制该机制成功与否的经验缩放规律仍未得到充分表征。本文针对高斯混合二分类任务开展了系统的上下文学习实证研究。基于Frei与Vardi（2024）的理论框架，我们分析了上下文测试准确率如何依赖于三个基本因素：输入维度、上下文示例数量以及预训练任务数量。通过受控合成实验与线性上下文分类器建模，我们分离出模型仅从上下文成功推断任务结构的几何条件。此外还探究了良性过拟合现象的出现机制——模型在记忆含噪上下文标签的同时，仍能在干净测试数据上实现优异泛化性能。通过跨越维度、序列长度、任务多样性及信噪比区间的全面参数扫描，我们识别出该现象产生的参数区域，并刻画其如何依赖于数据几何结构与训练暴露程度。研究结果为上下文分类中的缩放行为提供了全面的经验图谱，揭示了维度、信号强度与上下文信息在决定上下文学习成败中的关键作用。