Recent advancements in large language models (LLMs) integrated with external tools and APIs have successfully addressed complex tasks by using in-context learning or fine-tuning. Despite this progress, the vast scale of tool retrieval remains challenging due to stringent input length constraints. In response, we propose a pre-retrieval strategy from an extensive repository, effectively framing the problem as the massive tool retrieval (MTR) task. We introduce the MTRB (massive tool retrieval benchmark) to evaluate real-world tool-augmented LLM scenarios with a large number of tools. This benchmark is designed for low-resource scenarios and includes a diverse collection of tools with descriptions refined for consistency and clarity. It consists of three subsets, each containing 90 test samples and 10 training samples. To handle the low-resource MTR task, we raise a new query-tool alignment (QTA) framework leverages LLMs to enhance query-tool alignment by rewriting user queries through ranking functions and the direct preference optimization (DPO) method. This approach consistently outperforms existing state-of-the-art models in top-5 and top-10 retrieval tasks across the MTRB benchmark, with improvements up to 93.28% based on the metric Sufficiency@k, which measures the adequacy of tool retrieval within the first k results. Furthermore, ablation studies validate the efficacy of our framework, highlighting its capacity to optimize performance even with limited annotated samples. Specifically, our framework achieves up to 78.53% performance improvement in Sufficiency@k with just a single annotated sample. Additionally, QTA exhibits strong cross-dataset generalizability, emphasizing its potential for real-world applications.
翻译:近年来,大型语言模型(LLMs)与外部工具及API的集成已通过上下文学习或微调技术成功解决了复杂任务。尽管取得了这些进展,但由于严格的输入长度限制,大规模工具检索仍然面临挑战。为此,我们提出了一种从海量知识库中进行预检索的策略,从而将问题有效构建为大规模工具检索(MTR)任务。我们引入了MTRB(大规模工具检索基准)来评估包含大量工具的真实世界工具增强型LLM场景。该基准专为低资源场景设计,包含经过一致性与清晰度优化的多样化工具描述集合。它由三个子集构成,每个子集包含90个测试样本和10个训练样本。为处理低资源MTR任务,我们提出了一种新的查询-工具对齐(QTA)框架,该框架利用LLMs通过排序函数和直接偏好优化(DPO)方法重写用户查询,从而增强查询-工具对齐。该方法在MTRB基准测试的top-5和top-10检索任务中持续超越现有最先进模型,基于衡量前k个结果中工具检索充分性的Sufficiency@k指标,最高提升达93.28%。此外,消融研究验证了我们框架的有效性,突显了其在有限标注样本下优化性能的能力。具体而言,我们的框架仅使用单个标注样本即可在Sufficiency@k指标上实现最高78.53%的性能提升。同时,QTA展现出强大的跨数据集泛化能力,彰显了其在现实应用中的潜力。