In-context learning (ICL) performs tasks by prompting a large language model (LLM) using an instruction and a small set of annotated examples called demonstrations. Recent work has shown that precise details of the inputs used in the ICL prompt significantly impact performance, which has incentivized instruction selection algorithms. The effect of instruction-choice however is severely underexplored, with existing analyses restricted to shallow subsets of models and tasks, limiting the generalizability of their insights. We develop InstructEval, an ICL evaluation suite to conduct a thorough assessment of these techniques. The suite includes 13 open-sourced LLMs of varying scales from four model families, and covers nine tasks across three categories. Using the suite, we evaluate the relative performance of seven popular instruction selection methods over five metrics relevant to ICL. Our experiments reveal that using curated manually-written instructions or simple instructions without any task-specific descriptions often elicits superior ICL performance overall than that of automatic instruction-induction methods, pointing to a lack of generalizability among the latter. We release our evaluation suite for benchmarking instruction selection approaches and enabling more generalizable methods in this space.
翻译:上下文学习(In-context learning, ICL)通过向大型语言模型(LLM)提供指令及少量带有标注的示例(称为演示)来执行任务。近期研究表明,ICL提示中使用的输入细节会显著影响性能,这促使了指令选择算法的研究。然而,指令选择的影响尚未得到充分探索,现有分析仅局限于模型和任务的浅层子集,限制了其结论的普适性。我们开发了InstructEval,一个用于全面评估这些技术的ICL评估套件。该套件包含来自四个模型家族的13个不同规模的开源LLM,覆盖九项任务(分属三个类别)。利用该套件,我们基于与ICL相关的五项指标,评估了七种主流指令选择方法的相对性能。实验表明,使用人工精心编写的指令或无需特定任务描述的简单指令,其整体ICL性能通常优于自动指令归纳方法,后者的普遍适用性存在不足。我们发布了该评估套件,以促进指令选择方法的基准测试并推动该领域更具普适性的方法研究。