Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.

翻译：随着大语言模型（LLMs）的角色从语言统计建模转向通用人工智能代理，LLM评估应如何随之改变？可以说，人工智能代理的一个关键能力是根据需要灵活组合其已习得的基本技能。这种技能组合能力在（人类）教育学以及一篇关于涌现现象的论文（Arora & Goyal, 2023）中均扮演重要角色。本文提出Skill-Mix，一种衡量技能组合能力的新型评估方法。评估者从包含$N$种技能的列表中反复随机选取$k$种技能的子集，要求LLM生成组合该子集技能的文本。由于子集数量随$N^k$增长，即使$k$值适中，该评估也极有可能要求LLM生成与训练集中任何文本显著不同的内容。本文开发了一套方法论，用于（a）设计并实施此类评估，以及（b）利用GPT-4及开源LLaMA-2 70B模型对结果进行自动评分（辅以人工抽查）。将该评估应用于主流聊天机器人所得结果虽总体符合预期，但亦包含意外发现：现有LLM排行榜（"为排行榜而突击"）排名未能捕捉到的模型能力存在显著差异。此外，简单概率计算表明，GPT-4在$k=5$时的合理表现暗示其已超越"随机鹦鹉"行为（Bender et al., 2021），即它能以训练中未见的方式组合技能。我们概述了该方法如何构建基于Skill-Mix的开放评估生态系统，以衡量未来模型的AI能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日