Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Tianyu Liu,Allen Xin Wang,Antonia Panescu,Lisa Xinyi Chen,Wenxin Long,Xinyu Wei,Yueqian Jing,Ziyao Zeng,Jihang Chen,Sihan Jiang,Ziqing Wang,Siyi Gu,Siyu Chen,Xinyang Hu,Haoran Shao,Leqi Xu,Wangjie Zheng,Zhiyuan Cao,Ada Fang,Botao Yu,Kunyang Sun,Rex Ying,Arman Cohan,Qingyu Chen,Lingzhou Xue,Kaize Ding,Yuanqi Du,Wengong Jin,Zhuoran Yang,Marinka Zitnik,James Zou,Hua Xu,Hongyu Zhao

from arxiv, 6 figures

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.

翻译：人工智能智能体正被日益开发用于加速科学发现，但它们在真实研究场景中的实际能力仍鲜为人知。现有AI智能体基准测试很少能捕捉科学工作所需的复杂性、异质性和扩展推理能力，而科学任务基准测试往往将研究简化为静态的直接问题，对交互式评估支持有限。本文提出SciAgentArena——一个从多领域新兴需求中构建的系统性基准测试框架，用于评估AI智能体在真实科学研究场景中的表现。该基准包含约200个任务，配备逐步验证机制和与智能体无关的交互式评估环境。通过该基准测试，我们发现当前智能体在解决界定清晰的数据分析流程（尤其是任务结构和评估标准明确时）时能有效发挥作用，但其表现在不同科学情境中仍存在显著差异：智能体在生成真正新颖的见解、维持自主探索能力以及为开放式研究问题制定稳健解决方案方面存在困难。我们进一步总结了智能体的常见失败模式，并识别出提升其可靠性、自主性和科学推理能力的潜在方向。SciAgentArena为衡量AI智能体在科学领域的发展进度提供了实用框架，同时为未来设计能应对复杂科学挑战的智能体提供了指导。完整代码、任务与数据集可通过以下链接获取：https://sciagentarena.github.io/。