In-context learning (ICL) performs tasks by prompting a large language model (LLM) using an instruction and a small set of annotated examples called demonstrations. Recent work has shown that the precise details of the inputs used in the prompt significantly impacts ICL, which has incentivized instruction selection algorithms. The effect of instruction-choice however is severely underexplored, with existing analyses being restricted to shallow subsets of models and tasks, which limits the generalizability of their insights. We develop an ICL evaluation suite to conduct a thorough assessment of these techniques. The suite includes 13 open-sourced LLMs of varying scales from 4 distinct model families and covers 9 different tasks, representing a range of task types across 3 categories. In this work, we evaluate the relative performance of 7 popular instruction selection methods using our benchmark over five desiderata relevant to ICL. We discover that using curated manually-written instructions and simple instructions without any task-specific descriptions often elicits superior ICL performance than that of automatic instruction-induction methods, pointing to a lack of generalizability among the latter. We release our evaluation suite for benchmarking instruction selection approaches, and call for more rigorous and generalizable methods in this space.
翻译:上下文学习(ICL)通过向大语言模型(LLM)提供一条指令和一组少量标注示例(称为演示)来执行任务。近期研究表明,提示中使用的输入细节显著影响ICL效果,这催生了指令选择算法的研究。然而,指令选择的影响因素迄今仍严重缺乏探索,现有分析局限于模型和任务的浅层子集,导致其结论的泛化性受限。我们开发了一套ICL评估套件,旨在对这些技术进行全面的评估。该套件包含来自4个不同模型家族的13个开源LLM(规模各异),覆盖9项任务,涵盖3个类别中的多种任务类型。本研究利用该基准,基于ICL相关的五项期望指标,评估了7种主流指令选择方法的相对性能。我们发现,相比自动指令归纳方法,使用精心设计的人工编写指令及不含任务特定描述的简易指令往往能获得更优的ICL表现,这表明自动方法普遍缺乏泛化能力。我们公开了该评估套件以用于指令选择方法的基准测试,并呼吁该领域开发更严谨、更具泛化性的方法。