AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

An Luo,Jin Du,Xun Xian,Robert Specht,Fangqiao Tian,Ganghua Wang,Xuan Bi,Charles Fleming,Ashish Kundu,Jayanth Srinivasa,Mingyi Hong,Rui Zhang,Tianxi Li,Galin Jones,Jie Ding

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

翻译：数据科学在将复杂数据转化为各领域可操作洞察中发挥着关键作用。大语言模型（LLM）与人工智能（AI）智能体的最新进展已显著自动化了数据科学工作流。然而，当前AI智能体在特定领域数据科学任务中能在多大程度上匹敌人类专家表现，以及人类专业能力在哪些方面仍具优势，仍不明确。我们提出AgentDS——一个旨在评估AI智能体与人机协作在特定领域数据科学中表现的基准测试与竞赛平台。AgentDS涵盖六大行业的17项挑战：商业、食品生产、医疗健康、保险、制造业及零售银行业。我们组织了一场包含29支团队、80名参与者的公开竞赛，得以系统比较人机协作方法与纯AI基线方案。结果表明，当前AI智能体在特定领域推理环节存在困难。纯AI基线表现接近或低于竞赛参与者中位数，而最强解决方案均源自人机协作。这些发现挑战了AI完全自动化的叙事，凸显人类专业能力在数据科学中的持久价值，同时为下一代AI发展指明方向。访问AgentDS网站：https://agentds.org/ ，开源数据集：https://huggingface.co/datasets/lainmn/AgentDS 。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AgentOps综述：智能体系统运维框架

专知会员服务

18+阅读 · 6月4日

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

28+阅读 · 2月27日

OmniScientist: 迈向人类与 AI 科学家协同演化的生态系统

专知会员服务

19+阅读 · 1月19日

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日