LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent,Joseph D. Janizek,Michael Ruzo,Michaela M. Hinks,Michael J. Hammerling,Siddharth Narayanan,Manvitha Ponnapati,Andrew D. White,Samuel G. Rodriques

from arxiv, 40 pages, 5 main figures, 1 main table, 2 supplemental figures, 4 supplemental tables. Submitted to NeurIPS 2024 Datasets and Benchmarks track (in review)

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench

翻译：人们普遍认为，前沿大语言模型（LLMs）及其增强系统有潜力快速推动跨学科的科学发现。目前，已有许多基准测试用于衡量LLMs在教科书式科学问题上的知识与推理能力，但鲜有基准测试旨在评估语言模型在科学研究实际任务（如文献检索、实验方案规划与数据分析）上的表现。作为构建此类基准测试的初步尝试，我们提出了语言智能体生物学基准测试（LAB-Bench），这是一个包含超过2400道选择题的广泛数据集，用于评估人工智能系统在多项实用生物学研究能力上的表现，包括文献检索与推理、图表解读、数据库访问与导航，以及DNA和蛋白质序列的理解与操作。重要的是，与以往的科学基准测试不同，我们预期能够在难度较高的LAB-Bench任务上持续取得高分的AI系统，将成为研究人员在文献检索、分子克隆等领域的实用助手。作为对前沿语言模型新兴科学任务能力的初步评估，我们测量了多个模型在本基准测试上的性能，并报告了与人类生物学专家研究人员的对比结果。我们将持续更新和扩展LAB-Bench，预计其将成为未来自动化研究系统开发中的有用工具。LAB-Bench的公开子集可通过以下URL获取：https://huggingface.co/datasets/futurehouse/lab-bench