Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs' capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce RoTBench, a multi-level benchmark for evaluating the robustness of LLMs in tool learning. Specifically, we establish five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model's resilience across three critical phases: tool selection, parameter identification, and content filling. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. For instance, the performance of GPT-4 even drops significantly from 80.00 to 58.10 when there is no substantial change in manual accuracy. More surprisingly, the noise correction capability inherent in the GPT family paradoxically impedes its adaptability in the face of mild noise. In light of these findings, we propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning. The code and data are available at https://github.com/Junjie-Ye/RoTBench.
翻译:工具学习作为大语言模型与物理世界交互的关键手段,已引发广泛关注。现有研究主要聚焦于大语言模型在结构良好的环境中使用工具的能力,却忽视了其在面对现实世界不可避免的噪声时的稳定性。为弥合这一研究空白,我们提出RoTBench——一个用于评估大语言模型工具学习鲁棒性的多层级基准。具体而言,我们构建了五种外部环境,每种环境包含不同级别的噪声(即清洁、轻微、中等、严重及混合噪声),并在工具选择、参数识别和内容填充三个关键阶段深入分析模型的抗干扰能力。对六种主流模型的实验表明,提升大语言模型在工具学习中的鲁棒性具有紧迫性。例如,即使人工标注精度未发生显著变化,GPT-4的性能也骤降了21.9%(从80.00降至58.10)。更令人惊讶的是,GPT系列模型固有的噪声修正能力反而阻碍了其在轻微噪声环境中的适应性。基于这些发现,我们提出RoTTuning策略,通过丰富训练环境的多样性来增强大语言模型在工具学习中的鲁棒性。代码与数据已开源至https://github.com/Junjie-Ye/RoTBench。