AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

An Luo,Jin Du,Xun Xian,Robert Specht,Fangqiao Tian,Ganghua Wang,Xuan Bi,Charles Fleming,Ashish Kundu,Jayanth Srinivasa,Mingyi Hong,Rui Zhang,Tianxi Li,Galin Jones,Jie Ding

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform below the top quartile of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

翻译：[转录摘要] 数据科学在将复杂数据转化为跨领域可操作洞察方面发挥着关键作用。大型语言模型（LLM）和人工智能（AI）代理的最新进展显著自动化了数据科学工作流程。然而，目前尚不清楚AI代理在多大程度上能匹配人类专家在特定领域数据科学任务中的表现，以及人类专业知识在哪些方面仍具有优势。我们提出了AgentDS——一个旨在评估AI代理及人机协作在领域特定数据科学中表现的基准与竞赛。AgentDS包含横跨六大行业（商业、食品生产、医疗保健、保险、制造业及零售银行）的17项挑战。我们举办了一场公开竞赛，共29支团队、80名参与者参与，实现了对人机协作方法与纯AI基线的系统比较。结果表明，当前AI代理在领域特定推理方面存在困难：纯AI基线的表现低于竞赛参与者的前四分之一，而最强方案则来自人机协作。这些发现质疑了AI完全自动化的叙事，强调了人类专业知识在数据科学中的持久重要性，同时为下一代AI的发展指明了方向。访问AgentDS网站：https://agentds.org/ 及开源数据集：https://huggingface.co/datasets/lainmn/AgentDS 。

相关内容

关注 7111

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

智能体化人工智能 (Agentic AI) 的前行之路：挑战与机遇

专知会员服务

46+阅读 · 1月8日