With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for $46$k instances over $12$ benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is $18\%$ more accurate in "computing molar mass", but $19\%$ less accurate in "applying constitutional law", despite the overall accuracies of the three models differing by a mere $0.4\%$. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a $3\%$ accuracy improvement over our $12$ dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
翻译:随着模型能力不断增强,评估体系也日趋复杂,单个基准测试甚至同一测试实例中往往同时检验多项技能。然而,当仅考察总体准确率时,技能维度的性能表现容易被掩盖,导致现代基准测试所蕴含的丰富信号未能得到充分利用。我们提出一种自动分析方法,通过解析模型生成的推理过程来还原任何评估实例所涉及的底层技能。在验证了基于推理解析技能的相关性,并对12个基准测试中4.6万个实例完成技能推断后,我们发现许多技能在不同基准间具有共通性,由此整理出数百个技能切片(即检验同一技能实例的集合)。通过分析这些技能切片的准确率,我们获得了关于模型权衡关系的新洞见:例如相较于GPT-4o和Claude 3.5 Sonnet,Gemini 1.5 Pro在"计算摩尔质量"任务上平均准确率高18%,但在"适用宪法法律"任务上却低19%,而三者的总体准确率差异仅为0.4%。此外,我们通过实证证明技能切片分析所得洞见能够推广到保留实例:当根据相关技能将每个实例路由至最优模型时,在12个数据集上实现了3%的准确率提升。本研究提出的技能切片与分析框架为模型评估开辟了新路径,通过技能特异性分析实现对模型能力更细粒度、更具可操作性的理解。