Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.
翻译:智能体技能——即增强大语言模型(LLM)智能体能力的结构化、可复用知识构件——已在工业界得到快速应用,但其跨领域影响以及在商业与开源模型中的使用情况仍缺乏系统研究,目前也不存在用于评估单一技能的可复用方法论。本研究提出一个评估框架,允许技能开发者构建逼真任务,以严格评估技能中最关键的维度,并通过求解这些任务来估算技能的实用价值。在此基础上,我们将该评估方法规模化应用于500个真实技能,从技能内容中衍生出1000个任务,同时配套制定指令遵循度与目标完成度评分准则。利用这些指标,我们评估了19种智能体-模型配置(涵盖专有与开源模型)在任务上的表现。实验结果表明,不同模型在遵循技能所编码指令的程度上差异显著,这导致其性能增益呈现大幅分化。进一步,我们证明相较于无技能场景,技能接入会显著改变模型行为,从而为将固化工作流程编码至LLM智能体提供关键机制。我们已公开评估数据集,以支持未来关于智能体技能的研究工作。