Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem: given a complex user query and a large skill library, decompose the query into atomic sub-tasks, retrieve the appropriate skill for each sub-task, and compose an executable plan. We present SkillWeaver, a decompose-retrieve-compose framework combining an LLM task decomposer, a bi-encoder skill retriever with FAISS indexing, and a dependency-aware DAG planner. To support evaluation, we introduce CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills spanning 24 functional categories, sourced from the public MCP ecosystem. Our experiments reveal that task decomposition quality is the primary bottleneck: standard LLM decomposition reaches only 34.2% category recall at the step level. To address this, we propose Iterative Skill-Aware Decomposition (SAD), a retrieval-augmented feedback loop that iteratively aligns decomposition with available skills. SAD improves decomposition accuracy from 51.0% to 67.7% (+32.7%, Wilcoxon p < 10^-6) in a single iteration; DA-conditioned analysis confirms that correct granularity is the prerequisite for effective retrieval (CatR@1 rises from 34% to 41% when DA=1). SkillWeaver reduces context window consumption by over 99%, and transfer experiments confirm generalization (+35.6% relative DA gain even when target categories are absent from the retrieval pool).

翻译：LLM智能体日益依赖外部技能（即可复用的工具规范），但现实任务通常需要组合多种技能，而非仅选择单一技能。我们将此形式化为组合技能路由问题：给定复杂用户查询与大规模技能库，需将查询分解为原子性子任务，为每个子任务检索合适技能，并编排可执行计划。我们提出SkillWeaver——一种“分解-检索-编排”框架，融合了LLM任务分解器、基于FAISS索引的双编码器技能检索器，以及依赖感知的有向无环图规划器。为支撑评估，我们构建了CompSkillBench基准数据集，包含300个组合查询，涵盖来自公共MCP生态系统的2209项真实MCP服务器技能（分属24个功能类别）。实验表明，任务分解质量是主要瓶颈：标准LLM分解在步骤层面的类别召回率仅为34.2%。为此，我们提出迭代式技能感知分解（SAD）方法——一种增强检索的反馈循环，可迭代地将分解与可用技能对齐。单次迭代后，SAD将分解准确率从51.0%提升至67.7%（相对提升32.7%，Wilcoxon检验p<10^-6）；依赖感知分析证实，正确粒度是有效检索的先决条件（当DA=1时，CatR@1从34%升至41%）。SkillWeaver可减少超过99%的上下文窗口消耗，迁移实验验证了其泛化能力（即便目标类别不在检索池中，相对DA增益仍达35.6%）。