In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information.
翻译:近年来,预训练大型语言模型(LLMs)在推理时展现出一种称为上下文学习的少样本学习能力,其效率令人瞩目。然而,现有文献指出该能力对少样本示例的选择高度敏感。当前关于这种能力如何从常规语言模型预训练目标中涌现的底层机制理解,仍与现实世界的LLMs存在脱节。本研究旨在通过贝叶斯视角审视上下文学习现象,将现实世界的LLMs视为潜变量模型。基于此前提,我们提出一种算法——利用小型语言模型从标注数据集中选择最优示例,并直接将所选示例泛化至更大规模的LLMs。我们在八个真实文本分类数据集上对八个GPT模型进行平均测试,证明该方法相较于基线有显著提升。此外,我们还在数学应用题数据集GSM8K上验证了该算法的实际效用。实验结果支持我们的假设:LLMs能够隐式推断包含任务信息的潜变量。