From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

Zisu Huang,Jingwen Xu,Yifan Yang,Ziyang Gong,Qihao Yang,Muzhao Tian,Xiaohua Wang,Changze Lv,Xuemei Gao,Qi Dai,Bei Liu,Kai Qiu,Xue Yang,Dongdong Chen,Xiaoqing Zheng,Chong Luo

Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience. In particular, \emph{domain-level} and \emph{model-generated} skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- \textbf{experience generation}, \textbf{skill extraction}, and \textbf{skill consumption} -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete \emph{meta-skill} that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.

翻译：语言智能体通过复用从过往经验中提炼的结构化程序性制品——即技能——来持续改进。其中，领域级与模型生成的技能尤为值得关注：它们通过编码领域特定的重复性流程，实现快速的领域适应，并突破了人工精雕细琢的扩展瓶颈。然而，尽管技能提取方法层出不穷，但对技能的全生命周期（经验生成、技能提取、技能消费）仍缺乏系统性的理解——比如这些技能是否真正有效、何时有效、其成功与失败的原因何在。为填补这一空白，我们构建了一个基础效用导向的评估框架，在五个多样化的智能体任务领域上，跨越不同提取器与目标智能体，提供了系统性的实验结果。研究发现：模型生成的技能平均具有正向作用，但会引发非平凡的负迁移现象；提取器与目标智能体的行为模式并不统一——同一模型可能是强提取器却是弱消费者，或反之亦然，且技能效用与模型规模或基线任务能力无关。为解释这些规律，我们进一步深度剖析每个生命周期阶段：分析经验构成如何塑造技能质量，揭示实用技能的特征属性，以及同一技能在不同消费者之间的迁移模式。最终，我们将这些发现转化为具体的元技能——它引导技能提取聚焦于与真实效用相关的特征，从而在多个领域持续提升技能质量并显著降低负迁移效应。