Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .
翻译:数据科学在将复杂数据转化为各领域可操作洞察中发挥着关键作用。大语言模型(LLM)与人工智能(AI)智能体的最新进展已显著自动化了数据科学工作流。然而,当前AI智能体在特定领域数据科学任务中能在多大程度上匹敌人类专家表现,以及人类专业能力在哪些方面仍具优势,仍不明确。我们提出AgentDS——一个旨在评估AI智能体与人机协作在特定领域数据科学中表现的基准测试与竞赛平台。AgentDS涵盖六大行业的17项挑战:商业、食品生产、医疗健康、保险、制造业及零售银行业。我们组织了一场包含29支团队、80名参与者的公开竞赛,得以系统比较人机协作方法与纯AI基线方案。结果表明,当前AI智能体在特定领域推理环节存在困难。纯AI基线表现接近或低于竞赛参与者中位数,而最强解决方案均源自人机协作。这些发现挑战了AI完全自动化的叙事,凸显人类专业能力在数据科学中的持久价值,同时为下一代AI发展指明方向。访问AgentDS网站:https://agentds.org/ ,开源数据集:https://huggingface.co/datasets/lainmn/AgentDS 。