Large language models (LLMs) exploit in-context learning (ICL) to solve tasks with only a few demonstrations, but its mechanisms are not yet well-understood. Some works suggest that LLMs only recall already learned concepts from pre-training, while others hint that ICL performs implicit learning over demonstrations. We characterize two ways through which ICL leverages demonstrations. Task recognition (TR) captures the extent to which LLMs can recognize a task through demonstrations -- even without ground-truth labels -- and apply their pre-trained priors, whereas task learning (TL) is the ability to capture new input-label mappings unseen in pre-training. Using a wide range of classification datasets and three LLM families (GPT-3, LLaMA and OPT), we design controlled experiments to disentangle the roles of TR and TL in ICL. We show that (1) models can achieve non-trivial performance with only TR, and TR does not further improve with larger models or more demonstrations; (2) LLMs acquire TL as the model scales, and TL's performance consistently improves with more demonstrations in context. Our findings unravel two different forces behind ICL and we advocate for discriminating them in future ICL research due to their distinct nature.
翻译:大型语言模型(LLMs)利用上下文学习(ICL)仅凭少量示例即可解决任务,但其机制尚未被充分理解。部分研究认为LLMs仅从预训练中召回已学概念,而另一些研究则暗示ICL能对示例进行隐式学习。我们刻画了ICL利用示例的两种方式:任务识别(TR)衡量LLMs通过示例识别任务的能力——即便没有真实标签——并应用其预训练先验知识;而任务学习(TL)则是捕获预训练中未见的新输入-标签映射的能力。基于多种分类数据集及三个LLM系列(GPT-3、LLaMA和OPT),我们设计可控实验以解耦TR和TL在ICL中的作用。结果显示:(1)模型仅凭TR即能实现非平凡性能,且更大模型或更多示例不会进一步提升TR;(2)LLMs随模型规模扩大而获得TL能力,且TL性能随上下文示例增多持续提升。我们的发现揭示了ICL背后的两种不同驱动力,并主张未来ICL研究应因其本质差异而对二者加以区分。