Skill documents, structured natural-language instructions that guide Large Language Model (LLM) agents, are critical to modern agent frameworks, yet LLMs struggle to write skills that actually work. On SkillsBench, human-authored skills improve pass rates by 16.2 percentage points, while LLM-authored skills provide no measurable gain. We introduce SkillAxe, a fully unsupervised framework that enables LLMs to iteratively diagnose and refine their own skills. SkillAxe decomposes skill quality into four interpretable dimensions (quality impact, trigger precision, instruction compliance with fault attribution, and solution-path coverage), producing structured improvement briefs that require no ground-truth labels, test suites, or environment rewards. On SkillsBench, SkillAxe improves pass rates by 28\% relative over unimproved LLM skills and closes 47--67\% of the gap to human-authored skills. We validate the approach as a continuous improvement engine in the wild on SpreadsheetBench, where a SkillAxe-built skill library learns from past agent trajectories and raises pass rate from 16.0\% to 52.0\% using only 22 skills.
翻译:技能文档是一种结构化的自然语言指令,用于指导大语言模型智能体,对现代智能体框架至关重要。然而,大语言模型难以编写出实际有效的技能。在SkillsBench基准上,人类撰写的技能将通过率提升了16.2个百分点,而大语言模型撰写的技能则无明显增益。我们提出SkillAxe,一个完全无监督的框架,使大语言模型能够迭代式地诊断并精炼自身技能。SkillAxe将技能质量分解为四个可解释维度(质量影响、触发精度、指令合规性与故障归因、以及解决方案路径覆盖率),生成结构化的改进简报,无需真实标签、测试用例或环境奖励。在SkillsBench上,SkillAxe将相对通过率较未改进的大语言模型技能提升28%,并缩小了与人类技能差距的47%至67%。我们以SpreadsheetBench为真实环境验证了该方法作为持续改进引擎的能力:在该基准上,SkillAxe构建的技能库从过往智能体轨迹中学习,仅凭22个技能便将通过率从16.0%提升至52.0%。