最弱环节定律：大型语言模型的跨能力评估 (Law of the Weakest Link: Cross Capabilities of Large Language Models)

Ming Zhong,Aston Zhang,Xuewei Wang,Rui Hou,Wenhan Xiong,Chenguang Zhu,Zhengxing Chen,Liang Tan,Chloe Bi,Mike Lewis,Sravya Popuri,Sharan Narang,Melanie Kambadur,Dhruv Mahajan,Sergey Edunov,Jiawei Han,Laurens van der Maaten

from arxiv, Code: https://github.com/facebookresearch/llm-cross-capabilities

The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.

翻译：大型语言模型（LLMs）的研发与评估主要聚焦于单一能力维度。然而，现实任务往往需要融合不同专业领域的多重能力，这种能力交叉现象——我们称之为跨能力——在现有研究中尚未得到充分重视。为系统探究这一概念，我们首先定义了七项核心独立能力，进而将其两两组合形成七种常见跨能力，并为每种跨能力构建了人工标注的分类体系。基于此框架，我们提出了CrossEval基准测试集，包含1,400个人工标注的提示词，其中每种独立能力与跨能力各对应100个提示。为确保评估可靠性，我们邀请领域专家对4,200个模型响应进行标注，收集了8,400条附带详细解释的人工评分作为参考范例。研究发现：在静态评估与特定能力增强实验中，当前LLMs普遍呈现“最弱环节定律”——跨能力表现始终受限于最薄弱的构成能力。具体而言，在17个模型产生的58个跨能力评分中，38个评分低于所有独立能力评分，其余20个评分虽介于强弱能力之间，但更接近较弱能力水平。这些结果表明LLMs在跨能力任务中存在显著不足，因此识别并提升最弱能力将成为未来研究优化复杂多维场景性能的关键方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日