CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

翻译：针对大规模API目录的工具检索是LLM智能体的核心瓶颈：用户查询常以口语化且语义不明确的自然语言形式呈现，而目录采用技术性API词汇表，单一固定编码器无法弥合两者间的语义鸿沟。现有两种主流训练方法——对比编码器微调与基于冻结LLM的HyDE式查询扩展——从对立方向解决该问题却存在互补性缺陷：微调编码器在查询表层形式与目录匹配时表现优异，但在不匹配时性能骤降；零样本HyDE虽对模糊查询更具鲁棒性，却会生成脱离目录的假设性描述，导致精良查询的检索效果下降。本文提出CoHyDE——一种将稠密编码器与LLM改写器作为协同演化系统进行迭代训练的流程：编码器通过InfoNCE损失在改写器生成的目录风格假设描述上重新训练，改写器则基于编码器检索得分通过DPO进行偏好对齐，且在迭代循环启动前两端均使用工具目录进行热启动。在ToolBench目录约10k工具子集上，三轮CoHyDE训练后，标准查询的NDCG@5指标较最佳单组件基线提升2.5个百分点，保留模糊查询提升6.3个百分点，其中最困难模糊查询组提升幅度达8个百分点。消融实验证实协同训练为核心要素：单独使用任一组件均无法在精良查询与模糊查询上同时达到CoHyDE性能，其中模糊查询损失最高达8个百分点。