DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., image and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities, including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 13 recent advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, MiMo-V2-Pro and GPT-5.2 lead in duration and step efficiency, respectively, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04\% to 11.30\%. Overall, while current data science agents perform well on structured data and routine data analysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions.

翻译：近期基于大语言模型（LLM）的数据智能体致力于自动化从数据分析到深度学习的数据科学任务。然而，真实世界数据科学问题具有开放性——通常跨越多个分类体系且缺乏标准答案——这给评估带来了重大挑战。为解决这一问题，我们提出了DSAEval基准，该基准包含641个源自285个多样化数据集的真实世界数据科学问题，覆盖结构化与非结构化数据（如图像和文本）。DSAEval具备三个显著特征：(1) 多模态环境感知，使智能体能够解读来自文本与视觉等多种模态的观察结果；(2) 多查询交互，模拟真实世界数据科学项目迭代累积的特性；(3) 多维度评估，从推理、代码和结果三个层面提供整体性评估。我们使用DSAEval系统评估了13个近期先进的智能体LLM。结果表明，Claude-Sonnet-4.5在整体性能上表现最优，MiMo-V2-Pro在运行时长方面领先，GPT-5.2在步骤效率方面领先，而MiMo-V2-Flash则最具成本效益。我们进一步证明，多模态感知能持续提升视觉相关任务的性能，增益幅度达2.04%至11.30%。总体而言，尽管当前数据科学智能体在结构化数据和常规数据分析流程中表现良好，但在非结构化领域仍面临重大挑战。最后，我们提出了关键见解并展望了未来研究方向。