Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

LLM agents complete complex tasks by composing multiple skills, and skill retrieval is a front-end stage for agents. Skill retrieval differs fundamentally from traditional document retrieval at the supervision level: top-K joint correctness depends not only on the semantic relevance of each individual query-skill pair, but also on whether the skills retrieved together can collaborate to fulfill the task under the given query. Such "skill compatibility" cannot be derived from independent relevance alone. Yet existing LLM-based data synthesis pipelines can produce a direct supervision signal for "which skills should not be jointly retrieved under this query" -- namely the LLM's own rejection decisions -- and this signal is routinely discarded as low-quality data. To address this gap, we propose Reject-as-Resource Retriever (R3) and construct R3-Skill, a bilingual (Chinese-English) skill retrieval benchmark targeting realistic agent skill routing. R3-Skill spans four language directions, features query phrasings close to real user requests, and is verified through multi-expert cross-checking. On R3-Skill, we build a two-stage retrieval system (R3-Embedding + R3-Reranker) with skill compatibility as an explicit training signal. Gradient analysis shows that the "push-away" signal is diluted by bilateral balancing in the bi-encoder but acts as lossless graded ranking supervision in the cross-encoder -- motivating its placement at the cross-encoder stage, as confirmed by ablations on two datasets. The R3-Embedding + R3-Reranker pipeline attains Hit@1 = 0.7714, NDCG@10 = 0.8327 and Set-Compat = 0.3525 on R3-Skill. The dataset, training code and model weights are released as open source for agent skill routing.

翻译：LLM智能体通过组合多种技能完成复杂任务，技能检索是智能体的前置阶段。在监督层面，技能检索与传统文档检索存在根本差异：Top-K联合正确性不仅取决于每个查询-技能对的独立语义相关性，还取决于检索到的技能组合能否在给定查询下协作完成任务。这种"技能兼容性"无法从独立相关性中推导得出。然而，现有基于LLM的数据合成流水线可以产生"哪些技能不应在此查询下被联合检索"的直接监督信号——即LLM自身的拒绝决策——这一信号常被视为低质量数据而丢弃。为弥补这一空缺，我们提出拒绝即资源检索器（Reject-as-Resource Retriever, R3），并构建了面向真实智能体技能路由的双语（中-英）技能检索基准R3-Skill。R3-Skill涵盖四个语言方向，采用接近真实用户请求的查询措辞，并通过多专家交叉验证。在R3-Skill上，我们构建了以技能兼容性为显式训练信号的两阶段检索系统（R3-Embedding + R3-Reranker）。梯度分析表明，"推开"信号在双编码器中因双边平衡而被稀释，但在交叉编码器中可作为无损分级排序监督信号——这促使我们将其置于交叉编码器阶段，两个数据集上的消融实验证实了该设计。R3-Embedding + R3-Reranker流水线在R3-Skill上达到Hit@1=0.7714、NDCG@10=0.8327、Set-Compat=0.3525。数据集、训练代码及模型权重已开源，用于智能体技能路由。