Benchmarking AI scientists for omics data driven biological discovery

Recent advances in large language models have enabled the emergence of AI scientists that aim to autonomously analyze biological data and assist scientific discovery. Despite rapid progress, it remains unclear to what extent these systems can extract meaningful biological insights from real experimental data. Existing benchmarks either evaluate reasoning in the absence of data or focus on predefined analytical outputs, failing to reflect realistic, data-driven biological research. Here, we introduce BAISBench (Biological AI Scientist Benchmark), a benchmark for evaluating AI scientists on real single-cell transcriptomic datasets. BAISBench comprises two tasks: cell type annotation across 15 expert-labeled datasets, and scientific discovery through 193 multiple-choice questions derived from biological conclusions reported in 41 published single-cell studies. We evaluated several representative AI scientists using BAISBench and, to provide a human performance baseline, invited six graduate-level bioinformaticians to collectively complete the same tasks. The results show that while current AI scientists fall short of fully autonomous biological discovery, they already demonstrate substantial potential in supporting data-driven biological research. These results position BAISBench as a practical benchmark for characterizing the current capabilities and limitations of AI scientists in biological research. We expect BAISBench to serve as a practical evaluation framework for guiding the development of more capable AI scientists and for helping biologists identify AI systems that can effectively support real-world research workflows. The BAISBench can be found at: https://github.com/EperLuo/BAISBench, https://huggingface.co/datasets/EperLuo/BaisBench.

翻译：近年来，大型语言模型的进展催生了旨在自主分析生物数据并辅助科学发现的AI科学家。尽管发展迅速，这些系统能在多大程度上从真实实验数据中提取有意义的生物学见解仍不明确。现有基准测试要么在缺乏数据的情况下评估推理能力，要么聚焦于预定义的分析输出，未能反映真实的数据驱动生物学研究。本文提出BAISBench（生物AI科学家基准测试），一个用于在真实单细胞转录组数据集上评估AI科学家的基准。BAISBench包含两项任务：基于15个专家标注数据集的细胞类型注释，以及通过193道源自41项已发表单细胞研究生物学结论的多选题进行的科学发现评估。我们使用BAISBench评估了若干代表性AI科学家，并邀请六名研究生水平的生物信息学研究人员集体完成相同任务以提供人类性能基线。结果表明，尽管当前AI科学家尚未实现完全自主的生物学发现，但它们已在支持数据驱动的生物学研究方面展现出巨大潜力。这些结果确立了BAISBench作为表征AI科学家在生物学研究中当前能力与局限性的实用基准。我们预期BAISBench将作为实用的评估框架，指导开发更强大的AI科学家，并帮助生物学家识别能有效支持真实世界研究流程的AI系统。BAISBench可通过以下链接获取：https://github.com/EperLuo/BAISBench, https://huggingface.co/datasets/EperLuo/BaisBench。