With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.
翻译:随着大语言模型(LLMs)的角色从语言统计建模转向通用人工智能代理,LLM评估应如何随之改变?可以说,人工智能代理的一个关键能力是根据需要灵活组合其已习得的基本技能。这种技能组合能力在(人类)教育学以及一篇关于涌现现象的论文(Arora & Goyal, 2023)中均扮演重要角色。本文提出Skill-Mix,一种衡量技能组合能力的新型评估方法。评估者从包含$N$种技能的列表中反复随机选取$k$种技能的子集,要求LLM生成组合该子集技能的文本。由于子集数量随$N^k$增长,即使$k$值适中,该评估也极有可能要求LLM生成与训练集中任何文本显著不同的内容。本文开发了一套方法论,用于(a)设计并实施此类评估,以及(b)利用GPT-4及开源LLaMA-2 70B模型对结果进行自动评分(辅以人工抽查)。将该评估应用于主流聊天机器人所得结果虽总体符合预期,但亦包含意外发现:现有LLM排行榜("为排行榜而突击")排名未能捕捉到的模型能力存在显著差异。此外,简单概率计算表明,GPT-4在$k=5$时的合理表现暗示其已超越"随机鹦鹉"行为(Bender et al., 2021),即它能以训练中未见的方式组合技能。我们概述了该方法如何构建基于Skill-Mix的开放评估生态系统,以衡量未来模型的AI能力。