Tool learning methods have enhanced the ability of large language models (LLMs) to interact with real-world applications. Many existing works fine-tune LLMs or design prompts to enable LLMs to select appropriate tools and correctly invoke them to meet user requirements. However, it is observed in previous works that the performance of tool learning varies from tasks, datasets, training settings, and algorithms. Without understanding the impact of these factors, it can lead to inconsistent results, inefficient model deployment, and suboptimal tool utilization, ultimately hindering the practical integration and scalability of LLMs in real-world scenarios. Therefore, in this paper, we explore the impact of both internal and external factors on the performance of tool learning frameworks. Through extensive experiments on two benchmark datasets, we find several insightful conclusions for future work, including the observation that LLMs can benefit significantly from increased trial and exploration. We believe our empirical study provides a new perspective for future tool learning research.
翻译:工具学习方法增强了大语言模型(LLMs)与现实世界应用交互的能力。现有研究大多通过微调LLMs或设计提示词,使LLMs能够选择合适的工具并正确调用以满足用户需求。然而,先前工作观察到工具学习的性能会因任务、数据集、训练设置和算法的不同而产生差异。若不理解这些因素的影响,可能导致结果不一致、模型部署效率低下及工具利用次优化,最终阻碍LLMs在实际场景中的集成应用与规模化扩展。为此,本文系统探究了内外部因素对工具学习框架性能的影响。通过在两个基准数据集上的大量实验,我们为未来工作提出了若干具有启发性的结论,包括观察到增加尝试与探索能显著提升LLMs的性能。我们相信这项实证研究为未来工具学习研究提供了新的视角。