While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. Our evaluation framework includes an evaluation dataset and two evaluation modes. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. The two evaluation modes assess LLMs' ability with and without human assistance. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
翻译:尽管基于大语言模型的智能体通过使用外部工具解决复杂问题已取得显著进展,但对其能力进行基准测试仍具挑战性,这阻碍了对其局限性的清晰认知。本文提出一个名为CIBench的交互式评估框架,旨在全面评估大语言模型利用代码解释器执行数据科学任务的能力。我们的评估框架包含一个评估数据集和两种评估模式。该评估数据集采用大语言模型与人类协作的方式构建,通过利用连续交互的IPython会话来模拟真实工作流程。两种评估模式分别评估大语言模型在有无人工辅助情况下的能力。我们通过大量实验分析了24个大语言模型在CIBench上的表现,并为未来大语言模型在代码解释器利用方面提供了有价值的见解。