While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. Our evaluation framework includes an evaluation dataset and two evaluation modes. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. The two evaluation modes assess LLMs' ability with and without human assistance. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
翻译:尽管基于LLM的智能体通过使用外部工具解决复杂问题已取得显著进展,但对其能力进行基准测试仍具挑战性,这阻碍了对其局限性的清晰认知。本文提出一个名为CIBench的交互式评估框架,用于全面评估LLMs在数据科学任务中运用代码解释器的能力。我们的评估框架包含一个评估数据集和两种评估模式。评估数据集采用LLM-人工协同方式构建,通过利用连续交互的IPython会话来模拟真实工作流程。两种评估模式分别评估LLMs在有无人工协助下的能力表现。我们通过大量实验分析了24个LLMs在CIBench上的表现,并为未来LLMs在代码解释器应用方面提供了有价值的见解。