Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.
翻译:科学推理本质上要求整合复杂的工具包来驾驭领域特定知识。然而,当前的基准测试在很大程度上忽视了智能体为执行此类严谨工作流程而编排工具的能力。为弥补这一差距,我们引入了SciAgentGym,这是一个可扩展的交互式环境,涵盖四个自然科学学科的1,780个领域特定工具,并得到强大执行基础设施的支持。作为补充,我们提出了SciAgentBench,这是一个分层评估套件,旨在对智能体能力进行压力测试,范围从基本动作到长视野工作流程。我们的评估揭示了一个关键瓶颈:最先进的模型在复杂的科学工具使用方面存在困难。即使对于像GPT-5这样的领先模型,随着交互视野的延长,成功率也从60.6%急剧下降到30.9%,这主要归因于多步工作流程执行中的失败。为解决这一问题,我们提出了SciForge,一种数据合成方法,它将工具动作空间建模为依赖图,以生成逻辑感知的训练轨迹。通过对这些轨迹进行微调,我们的SciAgent-8B模型在性能上超越了规模大得多的Qwen3-VL-235B-Instruct,同时展现出科学工具使用能力的正向跨领域迁移。这些结果突显了下一代自主科学智能体的巨大潜力。