DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.

翻译：近期基于大语言模型的数据智能体旨在自动化从数据分析到深度学习的数据科学任务。然而，实际数据科学问题的开放性——通常跨越多个分类体系且缺乏标准答案——给评估带来了重大挑战。为此，我们提出了DSAEval，这是一个包含641个基于285个多样化数据集的实际数据科学问题的基准测试，涵盖结构化和非结构化数据（例如视觉和文本数据）。DSAEval包含三个显著特征：（1）多模态环境感知，使智能体能够解读来自文本和视觉等多种模态的观察结果；（2）多轮次交互查询，反映了实际数据科学项目的迭代性和累积性；（3）多维度评估，在推理、代码和结果方面提供整体性评估。我们使用DSAEval系统性地评估了11个先进的智能体大语言模型。我们的结果表明，Claude-Sonnet-4.5实现了最强的综合性能，GPT-5.2效率最高，而MiMo-V2-Flash最具成本效益。我们进一步证明，多模态感知能持续提升视觉相关任务的性能，提升幅度在2.04%至11.30%之间。总体而言，虽然当前的数据科学智能体在结构化数据和常规数据分析流程上表现良好，但在非结构化领域仍存在重大挑战。最后，我们提供了关键见解并概述了未来的研究方向，以推动数据科学智能体的发展。