Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
翻译:近年来,大型语言模型(LLMs)取得了显著进展,推动了工具学习的研究,该领域将LLMs与外部工具相结合以应对多样化的实际挑战。评估LLMs利用工具的能力需要大规模且稳定的基准测试。然而,先前的工作依赖于规模有限的手工在线工具,或存在API状态不稳定问题的大规模真实在线API。为解决这一问题,我们提出了StableToolBench——一个源自ToolBench的基准测试,包含虚拟API服务器和稳定评估系统。虚拟API服务器设有互补的缓存系统和API模拟器,以缓解API状态变化的影响。同时,稳定评估系统设计了基于GPT-4作为自动评估器的可解通过率与胜率指标,以消除评估过程中的随机性。实验结果表明了StableToolBench的稳定性,并进一步探讨了API模拟器、缓存系统及评估器系统的有效性。