Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.
翻译:LLM智能体的技能自演化方法旨在将执行轨迹转化为可复用的技能文档,但现有流程通常每个任务仅学习单条轨迹,在检查候选技能补丁前即合并,并在推理前加载完整技能库。本文提出SkillCAT——一种无需训练的框架,将该流程分为三个阶段:对比因果提取(CCE)通过为每个任务采样多条轨迹,对比同任务成功/失败对,识别解释结果差异的证据;评估增强演化(AAE)在源任务克隆上重放每个候选补丁,仅保留能改善或维持任务结果的补丁,再进行层级化技能补丁合并;拓扑感知任务执行(TTE)将演化后的技能编译为可路由的子技能拓扑结构,推理时仅加载与当前任务相关的能力节点。我们在 SpreadsheetBench、WikiTableQuestions 和 DocVQA 等常见智能体基准上评估SkillCAT,并进一步测试跨模型泛化与分布外泛化能力。在这些设置下,SkillCAT相较基线平均得分提升最高达40.40%,展示了无需模型训练即可实现的可靠技能演化。