A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We identify this retrieval interface, not planning, as the binding constraint on end-to-end agent performance, and introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText treats retrieval as test-time evolution of hypotheses: the agent generates natural-language pseudo-tool descriptions (revisable beliefs about the tool it needs), refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (three domains), FitText's reformulation strategies improve NDCG@5 by 2.7 to 10.6 points over static query retrieval across all base models; on StableToolBench (16,464 APIs) with GPT-5.4-mini, Memetic reaches an 84.3% pooled pass rate, a 26.7-point absolute gain over static query retrieval.
翻译:摘要:用户描述任务的方式与工具文档之间存在语义鸿沟。当API生态系统扩展至数万个端点时,仅凭初始查询的静态检索无法弥合这一鸿沟:智能体在执行过程中对其需求的理解会不断演化,但其工具集却保持不变。本文指出,制约端到端智能体性能的核心瓶颈并非规划能力,而是检索接口,并由此提出FitText——一种无需训练的框架,通过将检索直接嵌入智能体推理循环实现动态化。FitText将检索视为假设的测试时演化:智能体首先生成自然语言伪工具描述(关于所需工具的可修正信念),利用检索反馈迭代优化这些描述,并通过随机生成探索多样化候选方案。模因检索通过工具记忆引导的进化选择压力对候选描述进行筛选,从而避免冗余搜索。在ToolRet基准(涵盖三个领域)上,FitText的重构策略使所有基座模型的NDCG@5指标相比静态查询检索提升2.7至10.6个百分点;在包含16,464个API的StableToolBench基准上,使用GPT-5.4-mini时,模因检索达到84.3%的累积通过率,相较静态查询检索实现26.7个百分点的绝对提升。