Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

LLM agents often solve complex tasks by composing skills, making skill retrieval a front-end component of agent systems. Unlike document retrieval, top-K correctness in skill retrieval depends not only on the relevance of each query-skill pair, but also on whether the retrieved skills can work together under the query. This query-conditioned "skill compatibility" cannot be recovered from independent relevance alone. However, LLM-based synthesis pipelines already produce a useful signal for it: the LLM's own rejection decisions, which specify which skills should not be retrieved together for a given query, but are usually discarded as low-quality data. We propose Reject-as-Resource Retriever (R3) and construct R3-Skill, a bilingual (Chinese-English) benchmark for agent skill routing. R3-Skill covers four language directions and uses LLM-rewritten queries that better approximate user requests; its test-set ground truth is verified by multiple experts. It contains 10,246 skills grouped into 8 thematic super-domains, 41,592 accepted queries, and 32,828 LLM-rejected annotations, further organized into an 8-class rejection-reason taxonomy. R3-Skill keeps this normally discarded rejection signal and uses it as compatibility supervision. On R3-Skill, we train a two-stage retriever consisting of R3-Embedding and R3-Reranker. Gradient analysis explains why this query-conditional signal is weak when injected into the tested bi-encoder objective under bilateral balancing, while a cross-encoder can use it as graded ranking supervision; R3-Skill ablations support this split. The R3-Embedding + R3-Reranker pipeline reaches Hit@1 = 0.7521, NDCG@10 = 0.8173 and Set-Compat = 0.3188 on R3-Skill. The dataset, model weights, and evaluation scripts will be open-sourced.

翻译：LLM智能体常通过组合技能解决复杂任务，这使得技能检索成为智能体系统的前端组件。与文档检索不同，技能检索的Top-K正确性不仅取决于每个查询-技能对的相关性，还取决于所检索技能能否在查询条件下协同工作。这种基于查询条件的"技能兼容性"无法仅凭独立相关性恢复。然而，基于LLM的合成流程已为此产生有用信号：LLM自身的拒绝决策——这些决策指定了特定查询下不应同时检索的技能，但通常被作为低质量数据丢弃。我们提出拒绝即资源检索器（R3）并构建R3-Skill——面向智能体技能路由的中英双语基准。R3-Skill覆盖四种语言方向，采用经LLM重写的更贴近用户需求的查询语句；其测试集标注经多位专家验证。该基准包含分属8个主题超级领域的10,246项技能、41,592条被接受查询及32,828条LLM拒绝标注，并进一步构建为8类拒绝原因分类体系。R3-Skill保留了这种通常被丢弃的拒绝信号，并将其用作兼容性监督。在R3-Skill上，我们训练了由R3-Embedding和R3-Reranker组成的两阶段检索器。梯度分析解释了为何在双边平衡条件下，此类查询条件信号注入测试双编码器目标时表现薄弱，而交叉编码器可将其用作分级排序监督；R3-Skill消融实验支持此分离方案。R3-Embedding+R3-Reranker管道在R3-Skill上达到Hit@1=0.7521、NDCG@10=0.8173和Set-Compat=0.3188。数据集、模型权重及评估脚本将开源。