In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluation. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent .
翻译:本文提出了InfiAgent-DABench,这是首个专门设计用于评估基于大语言模型(LLM)的智能体在数据分析任务中表现的基准测试。此类任务要求智能体通过与执行环境交互,端到端地解决复杂问题。该基准包含DAEval数据集(由源自52个CSV文件的257个数据分析问题组成)以及一个将LLM集成以服务于评估和服务的智能体框架。由于数据分析问题通常具有开放性且难以在无人工监督下评估,我们采用格式提示(format-prompting)技术将每个问题转化为闭合形式,从而实现自动评估。我们对34个LLM的广泛基准测试揭示了当前数据分析任务中面临的挑战。此外,基于我们的智能体框架,我们开发了专用智能体DAAgent,其在DABench上的表现比GPT-3.5高出3.9%。InfiAgent-DABench的评估数据集和工具包已在https://github.com/InfiAgent/InfiAgent 发布。