挖掘技能层面洞见以理解基础模型的权衡 (Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models)

With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for $46$k instances over $12$ benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is $18\%$ more accurate in "computing molar mass", but $19\%$ less accurate in "applying constitutional law", despite the overall accuracies of the three models differing by a mere $0.4\%$. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a $3\%$ accuracy improvement over our $12$ dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

翻译：随着模型能力不断增强，评估体系也日趋复杂，单个基准测试甚至同一测试实例中往往同时检验多项技能。然而，当仅考察总体准确率时，技能维度的性能表现容易被掩盖，导致现代基准测试所蕴含的丰富信号未能得到充分利用。我们提出一种自动分析方法，通过解析模型生成的推理过程来还原任何评估实例所涉及的底层技能。在验证了基于推理解析技能的相关性，并对12个基准测试中4.6万个实例完成技能推断后，我们发现许多技能在不同基准间具有共通性，由此整理出数百个技能切片（即检验同一技能实例的集合）。通过分析这些技能切片的准确率，我们获得了关于模型权衡关系的新洞见：例如相较于GPT-4o和Claude 3.5 Sonnet，Gemini 1.5 Pro在"计算摩尔质量"任务上平均准确率高18%，但在"适用宪法法律"任务上却低19%，而三者的总体准确率差异仅为0.4%。此外，我们通过实证证明技能切片分析所得洞见能够推广到保留实例：当根据相关技能将每个实例路由至最优模型时，在12个数据集上实现了3%的准确率提升。本研究提出的技能切片与分析框架为模型评估开辟了新路径，通过技能特异性分析实现对模型能力更细粒度、更具可操作性的理解。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/