ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Qiushi Sun,Zhoumianze Liu,Chang Ma,Zichen Ding,Fangzhi Xu,Zhangyue Yin,Haiteng Zhao,Zhenyu Wu,Kanzhi Cheng,Zhaoyang Liu,Jianing Wang,Qintong Li,Xiangru Tang,Tianbao Xie,Xiachong Feng,Xiang Li,Ben Kao,Wenhai Wang,Biqing Qi,Lingpeng Kong,Zhiyong Wu

from arxiv, ICLR 2026 Camera Ready Version

Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

翻译：大型语言模型（LLM）的影响力已超越自然语言处理领域，极大地推动了跨学科研究的发展。近年来，多种基于LLM的智能体被开发出来，以从多个层面和领域辅助科学发现进程。其中，能够像人类一样与操作系统交互的计算机使用智能体，正为自动化科学问题求解及优化研究者工作流程中的常规操作铺平道路。为充分认识这些智能体的变革潜力，我们提出ScienceBoard，它包含两个互补贡献：（i）一个具备动态、丰富视觉特征的真实多领域环境，集成专业软件，智能体可通过不同接口自主交互以加速复杂研究任务与实验；（ii）一个由人工严格验证的169个高质量真实世界任务挑战性基准，涵盖生物化学、天文学和地理信息学等领域的科学发现工作流。对基于最先进骨干模型（如GPT-4o、Claude 3.7、UI-TARS）的智能体进行的大规模评估表明，尽管取得了部分令人鼓舞的结果，但这些模型在可靠辅助科学家完成复杂工作流方面仍存在差距，整体成功率仅为15%。深入分析进一步为应对当前智能体局限性及设计更有效的原则提供了宝贵见解，从而为构建更强大的科学发现智能体奠定基础。我们的代码、环境及基准数据见https://qiushisun.github.io/ScienceBoard-Home/。