In this paper, we introduce "InfiAgent-DABench", the first benchmark specifically designed to evaluate LLM-based agents in data analysis tasks. This benchmark contains DAEval, a dataset consisting of 311 data analysis questions derived from 55 CSV files, and an agent framework to evaluate LLMs as data analysis agents. We adopt a format-prompting technique, ensuring questions to be closed-form that can be automatically evaluated. Our extensive benchmarking of 23 state-of-the-art LLMs uncovers the current challenges encountered in data analysis tasks. In addition, we have developed DAAgent, a specialized agent trained on instruction-tuning datasets. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent.
翻译:本文提出了"InfiAgent-DABench",这是首个专门用于评估基于大语言模型(LLM)的智能体在数据分析任务中表现的标准基准。该基准包含DAEval数据集(涵盖从55个CSV文件中提取的311个数据分析问题)和一个用于评估LLM作为数据分析智能体的智能体框架。我们采用格式提示技术,确保问题具有可自动评估的闭合形式。通过对23个最先进LLM的广泛基准测试,我们揭示了当前数据分析任务中存在的挑战。此外,我们开发了DAAgent,一个基于指令微调数据集训练的专业智能体。InfiAgent-DABench的评估数据集和工具包已在https://github.com/InfiAgent/InfiAgent开源发布。