Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.
翻译:工具对于大型语言模型(LLMs)获取实时信息并在外部环境中执行关键行动至关重要。现有关于工具增强型LLM的研究主要关注工具的广泛覆盖性和新增工具的灵活性。然而,一个令人惊讶地被忽视的关键问题在于:LLM在训练过的工具上究竟能实现多高的准确率?我们发现,包括GPT-4和专门针对工具使用进行微调的开源LLM在内,现有模型的正确率仅处于30%至60%区间,远未达到实际应用的可靠性要求。受生物系统启发,我们提出一种针对工具增强型LLM的方法——模拟试错(STE)——该方法协调了生物系统中成功使用工具行为的三种关键机制:试错、想象与记忆。具体而言,STE利用LLM的“想象”能力生成使用工具的合理场景,随后通过LLM与工具的交互学习执行反馈。短期与长期记忆分别用于提升探索的深度与广度。在ToolBench上的综合实验表明,STE在上下文学习和微调两种设置下均能显著提升LLM的工具学习能力,使Mistral-Instruct-7B的性能提升46.7%,并使其超越GPT-4。我们还通过简单的经验回放策略展示了工具的有效持续学习能力。