While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. Our evaluation framework includes an evaluation dataset and two evaluation modes. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. The two evaluation modes assess LLMs' ability with and without human assistance. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
翻译:尽管基于大型语言模型(LLM)的智能体通过调用外部工具解决复杂问题已取得显著进展,但其能力评估仍面临挑战,这阻碍了对其局限性的清晰认知。本文提出一个名为CIBench的交互式评估框架,用于全面评估LLM在数据科学任务中运用代码解释器的能力。该评估框架包含一个评估数据集和两种评估模式:评估数据集采用LLM-人工协同构建方法,通过连续交互的IPython会话模拟真实工作流程;两种评估模式分别测试LLM在有无人工辅助下的表现。我们通过对24个LLM在CIBench上进行大量实验,深入分析了其代码解释器运用能力,并为未来LLM在此方向的发展提供了重要洞见。