The rapid evolution of large language models (LLMs) has expanded their capabilities from basic dialogue to advanced scientific reasoning. However, existing benchmarks in biology often fail to assess a critical skill required of researchers: the ability to integrate experimental results with contextual knowledge to derive meaningful conclusions. To address this gap, we introduce BABE(Biology Arena BEnchmark), a comprehensive benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems. BABE is uniquely constructed from peer-reviewed research papers and real-world biological studies, ensuring that tasks reflect the complexity and interdisciplinary nature of actual scientific inquiry. BABE challenges models to perform causal reasoning and cross-scale inference. Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.
翻译:大型语言模型(LLM)的快速发展已将其能力从基础对话扩展到高级科学推理。然而,现有的生物学基准往往未能评估研究人员所需的一项关键技能:即整合实验结果与背景知识以得出有意义结论的能力。为弥补这一不足,我们引入了BABE(生物学竞技场基准),这是一个旨在评估生物人工智能系统实验推理能力的综合性基准。BABE独特地构建于同行评议的研究论文和真实世界的生物学研究之上,确保任务反映实际科学探究的复杂性和跨学科本质。BABE挑战模型执行因果推理和跨尺度推断。我们的基准为评估人工智能系统在多大程度上能够像实践科学家一样推理提供了一个稳健的框架,从而为其在生物学研究中的潜在贡献提供了更真实的衡量标准。