In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability. In this work, we introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning. Through the ICLEval benchmark, we demonstrate that ICL ability is universally present in different LLMs, and model size is not the sole determinant of ICL efficacy. Surprisingly, we observe that ICL abilities, particularly copying, develop early in the pretraining process and stabilize afterward. Our source codes and benchmark are released at https://github.com/yiye3/ICLEval.
翻译:上下文学习(ICL)是大型语言模型(LLMs)的一项关键能力,它使模型能够理解并推理相互关联的输入。评估LLMs的ICL能力可以提升其应用效果,并加深我们对训练阶段该能力如何获得的理解。然而,现有评估框架主要关注语言能力和知识,往往忽视对ICL能力的评估。本研究提出ICLEval基准来评估LLMs的ICL能力,该基准涵盖两个关键子能力:精确复制与规则学习。通过ICLEval基准,我们证明ICL能力普遍存在于不同LLMs中,且模型规模并非ICL效能的唯一决定因素。令人惊讶的是,我们发现ICL能力(特别是复制能力)在预训练早期即形成并随后趋于稳定。我们的源代码与基准已发布于https://github.com/yiye3/ICLEval。