SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards -- signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground-truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

翻译：摘要：智能体技能是一种结构化的程序化封装模块，用于引导冻结的大型语言模型智能体在专业化工作流程中执行任务。部署后，技能往往难以保持充分有效：边界情况、应用程序接口变更以及部署约束只能通过实际使用才得以显现，这使得技能进化成为实践中的必要需求。现有方法依赖于特权反馈信号，如保留验证集分数、隐藏测试结果或环境奖励——这些信号通常无法获取，实践者仅拥有任务描述和工作空间数据。我们提出SkillAudit框架，该框架无需真实标注反馈即可实现技能进化。其核心思想是配对轨迹审计：在每次迭代中，针对同一任务分别执行包含候选技能和不包含候选技能的操作，通过隔离技能对智能体行为的影响来避免依赖外部标签。为将行为差异转化为编辑指导信号，SkillAudit采用过程对齐对比评估方法——一组评估器集群，可将轨迹分歧映射至与技能文档特定段落相关联的诊断信号。从任务规范一次性编译而成的结构验证器负责检查任务约束并回滚有害更新。SkillAudit通过两条路径路由编辑操作：提炼路径移除泛用技能中的噪声或无关指导，修复路径替换与任务冲突的段落。在覆盖8个专业领域的89项容器化任务中，SkillAudit实现了73.9%的平均任务奖励，显著优于无技能智能体（40.9%）和静态专家技能（56.7%）。这些提升在进化过程中无需访问隐藏测试、参考解决方案或外部评分函数即可获得。