Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs' capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce RoTBench, a multi-level benchmark for evaluating the robustness of LLMs in tool learning. Specifically, we establish five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model's resilience across three critical phases: tool selection, parameter identification, and content filling. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. For instance, the performance of GPT-4 even drops significantly from 80.00 to 58.10 when there is no substantial change in manual accuracy. More surprisingly, the noise correction capability inherent in the GPT family paradoxically impedes its adaptability in the face of mild noise. In light of these findings, we propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning. The code and data are available at https://github.com/Junjie-Ye/RoTBench.
翻译:工具学习作为大语言模型与物理世界交互的重要方式,已引发广泛研究兴趣。当前研究主要强调大语言模型在结构良好环境中使用工具的能力,却忽视了其在面对现实世界不可避免噪声时的稳定性。为填补这一空白,我们提出了RoTBench——一个用于评估大语言模型在工具学习中鲁棒性的多层次基准。具体而言,我们构建了五种外部环境,每种环境包含不同噪声等级(即清洁、轻微、中等、严重与混合),从而在工具选择、参数识别与内容填充三个关键阶段对模型的抗干扰能力进行深度分析。基于六种广泛使用模型的实验表明,提升大语言模型在工具学习中的鲁棒性具有迫切必要性——例如,在人工准确率未显著变化的情况下,GPT-4的性能竟从80.00骤降至58.10。更令人惊讶的是,GPT系列固有的噪声校正能力在面临轻微噪声时反而阻碍了其适应性。基于上述发现,我们提出了RoTTuning策略,该策略通过丰富训练环境的多样性来增强大语言模型在工具学习中的鲁棒性。相关代码与数据已发布于https://github.com/Junjie-Ye/RoTBench。