In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluation. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent .
翻译:本文介绍了InfiAgent-DABench,这是首个专门设计用于评估基于大语言模型(LLM)的智能体在数据分析任务上表现的基准。这些任务要求智能体通过与执行环境交互,端到端地解决复杂问题。该基准包含DAEval数据集(由52个CSV文件衍生出的257道数据分析问题)以及一个智能体框架,该框架集成LLM作为数据分析智能体,同时支持服务与评估功能。由于数据分析问题通常具有开放性且难以在无人工监督下评估,我们采用格式引导技术将每个问题转化为闭合形式,从而实现自动化评估。我们对34个LLM的广泛基准测试揭示了当前数据分析任务中面临的挑战。此外,基于我们的智能体框架,我们开发了专用智能体DAAgent,其在DABench上的性能比GPT-3.5高出3.9%。InfiAgent-DABench的评估数据集与工具包已发布于https://github.com/InfiAgent/InfiAgent。