AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

Edward De Brouwer,Carl Edwards,Alexander Wu,Jenna Collier,Graham Heimberg,Xiner Li,Meena Subramaniam,Ehsan Hajiramezanali,David Richmond,Jan-Christian Hütter,Sara Mostafavi,Gabriele Scalia

from arxiv, 22 pages

Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in silico phenotypic screening and, more broadly, virtual cell models.

翻译：近期机器学习的进步与大尺度生物数据集的积累，重新激发了构建虚拟细胞——一种能加速生物学发现的细胞行为计算模型——的愿景。该愿景最具吸引力的承诺之一是实现计算机表型筛选，即模型在未见生物学情境中预测细胞扰动效应。此任务将异质性文本输入与多样性表型输出相结合，特别适合大语言模型（LLM）与智能体系统。然而，当前尚缺乏此类任务的标准基准——现有工作聚焦于更狭窄的分子读数，其与驱动真实药物研发流程的表型终点仅存在间接关联。本文提出AssayBench——一个基于1920个公开CRISPR筛选数据集（涵盖五大类细胞表型）构建的表型筛选预测基准。我们将筛选预测任务形式化为每个筛选的基因排名预测，并引入调整后的归一化折损累计增益（adjusted nDCG）作为跨异质性化验性能比较的连续度量。广泛评估表明：现有方法仍远未达到经验估计的性能天花板；零样本通用型LLM优于生物学专用LLM与可训练基线；微调、集成与提示优化等优化技术可进一步提升LLM在此任务上的表现。总体而言，AssayBench为衡量计算机表型筛选乃至虚拟细胞模型的进展提供了实用测试平台。