Agent Skills are structured packages of procedural knowledge that augment large language model (LLM) agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark whose current inventory contains 87 tasks across 8 domains paired with curated Skills and deterministic verifiers. Our latest aggregate evaluation runs the 87-task benchmark under matched no-Skills and curated-Skills conditions for 18 model-harness configurations. Curated Skills raise the average pass rate from 33.9% to 50.5% (+16.6 percentage points; 25.5% normalized gain), with configuration-level gains ranging from +4.1 to +25.7 pp. Focused Skills with at most three modules outperform larger or exhaustive bundles, and smaller models with Skills can match larger models without them. SkillsBench establishes paired evaluation as the foundation for rigorous measurement of Skill efficacy on agentic, expertise-heavy work.
翻译:摘要:智能体技能是结构化程序性知识包,在推理阶段增强大语言模型的智能体能力。尽管被快速采用,但目前缺乏标准化方法来衡量其实际效用。我们提出SkillsBench基准,其当前库存包含8个领域87项任务,配套精心策划的技能和确定性验证器。最新聚合评估在无技能和策划技能条件下对18个模型-框架配置运行了87项任务基准。策划技能将平均通过率从33.9%提升至50.5%(+16.6个百分点;标准化提升25.5%),配置级提升范围从+4.1到+25.7个百分点。包含至多三个模块的聚焦技能优于更大或详尽的技能包,且配备技能的小型模型性能可媲美无技能的大型模型。SkillsBench确立了配对评估作为严格衡量技能在自主化、专业知识密集工作中效力的基础。