The realization of autonomous scientific experimentation is currently limited by LLMs' struggle to grasp the strict procedural logic and accuracy required by biological protocols. To address this fundamental challenge, we present \textbf{BioProBench}, a comprehensive resource for procedural reasoning in biology. BioProBench is grounded in \textbf{BioProCorpus}, a foundational collection of 27,000 human-written protocols. From this corpus, we systematically constructed a dataset of over 550,000 task instances, offering both a large-scale training resource and a rigorous benchmark with novel metrics. Evaluating 10 mainstream LLMs, we find that while general comprehension is high, performance drops significantly on tasks demanding deep reasoning, quantitative precision, and safety awareness. To demonstrate the value of BioProCorpus in mitigating these issues, we developed \textbf{ProAgent}, grounded in our corpus, ProAgent substantially advances the state-of-the-art. BioProBench provides a rigorous diagnostic benchmark and a foundational resource for developing the next generation of reliable scientific AI. Code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/BioProBench/BioProBench.
翻译:实现自主科学实验目前受限于大型语言模型难以掌握生物实验方案所要求的严格程序逻辑与精确性。为应对这一根本性挑战,我们提出了\textbf{BioProBench}——一个面向生物学领域程序化推理的综合资源平台。BioProBench基于\textbf{BioProCorpus}构建,该基础语料库收录了27,000份人工撰写的实验方案。基于此语料库,我们系统构建了包含超过55万个任务实例的数据集,既提供了大规模训练资源,也建立了包含新型评估指标的严格基准测试。通过对10个主流大型语言模型的评估,我们发现虽然模型在通用理解任务上表现良好,但在需要深度推理、定量精确性和安全意识的复杂任务上性能显著下降。为展示BioProCorpus在解决这些问题方面的价值,我们基于该语料库开发了\textbf{ProAgent}模型,该模型显著推进了该领域的技术前沿。BioProBench为开发下一代可靠的科学人工智能提供了严谨的诊断基准和基础资源。代码与数据公开于:https://github.com/YuyangSunshine/bioprotocolbench 及 https://huggingface.co/datasets/BioProBench/BioProBench。