As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.
翻译:随着大语言模型演化为智能问题求解器,它们越来越依赖外部可复用技能来处理超越其原生参数能力的任务。在现有智能体系统中,整合技能的主导策略是在上下文窗口内显式列出可用技能。然而,这种策略难以扩展:随着技能语料库扩大,上下文预算被迅速消耗,且智能体识别正确技能的准确性显著下降。为此,本文提出技能检索增强(SRA)这一新范式,使智能体能够按需从大型外部技能语料库中动态检索、整合并应用相关技能。为使该问题可量化,我们构建了大规模技能语料库,并引入首个面向SRA全流程分解评估的基准——SRA-Bench,涵盖技能检索、技能整合及最终任务执行三个环节。该基准包含5,400个高能力需求的测试实例和636个人工构建的金标准技能,并与网络收集的干扰技能混合形成含26,262项技能的大规模语料库。大量实验表明,基于检索的技能增强可显著提升智能体性能,验证了该范式的潜力。同时,我们发现技能整合中存在根本性差距:当前基于LLM的智能体倾向于以相似速率加载技能,无论是否检索到金标准技能或任务是否真正需要外部能力。这表明技能增强的瓶颈不仅在于检索环节,更在于基础模型判断何时加载技能及是否需要外部加载的能力。这些发现使SRA成为独立研究问题,并为未来智能体系统中能力的可扩展增强奠定基础。