Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
翻译:近年来,大语言模型(LLMs)取得了显著进展,推动了工具学习的研究,该领域旨在将LLMs与外部工具相结合以应对多样化的现实世界挑战。评估LLMs利用工具的能力需要大规模且稳定的基准测试。然而,先前的研究要么依赖规模有限的手工构建在线工具,要么使用大规模真实在线API却受限于API状态的不稳定性。为解决此问题,我们引入了StableToolBench,这是一个从ToolBench演化而来的基准测试,提出了虚拟API服务器和稳定评估系统。虚拟API服务器包含缓存系统和API模拟器,两者互补以缓解API状态的变化。同时,稳定评估系统设计了可求解通过率和胜率,使用GPT-4作为自动评估器以消除评估过程中的随机性。实验结果证明了StableToolBench的稳定性,并进一步讨论了API模拟器、缓存系统和评估器系统的有效性。