Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
翻译:近年来,大型语言模型取得了显著进展,推动了工具学习的发展——将大型语言模型与外部工具相结合以应对多样化的现实挑战。评估大型语言模型使用工具的能力需要大规模且稳定的基准测试。然而,先前的研究要么依赖规模受限的手工在线工具,要么依赖因API状态不稳定而受影响的大规模真实在线API。为解决这一问题,我们提出了StableToolBench——基于ToolBench演进的基准测试,它引入了虚拟API服务器和稳定评估系统。虚拟API服务器包含缓存系统与API模拟器,两者相互补充以缓解API状态变化带来的影响。同时,稳定评估系统设计了基于GPT-4自动评估器的可解通率与胜率指标,以消除评估过程中的随机性。实验结果表明了StableToolBench的稳定性,并进一步探讨了API模拟器、缓存系统及评估系统的有效性。