SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, and scientific QA) that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce knowledge-augmented evaluation, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that (i) under the Virtual Cell unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. SC-Arena thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.

翻译：大型语言模型（LLM）在科学研究中的应用日益广泛，为知识发现与推理提供了新的能力。然而，在单细胞生物学领域，无论是通用还是专用LLM的评估实践仍不充分：现有基准分散于不同任务，采用多项选择分类等与现实应用脱节的形式，且依赖缺乏可解释性与生物学依据的指标。我们提出了SC-ARENA，一个专为单细胞基础模型设计的自然语言评估框架。SC-ARENA形式化了一种虚拟细胞抽象，通过表征内在属性与基因层面相互作用来统一评估目标。在此范式下，我们定义了五项自然语言任务（细胞类型注释、描述生成、细胞生成、扰动预测与科学问答），以探究细胞生物学中的核心推理能力。为克服脆弱字符串匹配指标的局限，我们引入了知识增强评估，该方法整合外部本体、标记基因数据库与科学文献，以支持符合生物学事实且可解释的判断。在通用与领域专用LLM上的实验与分析表明：（i）在虚拟细胞统一评估范式下，现有模型在生物学复杂任务上表现不均，尤其在对机制或因果理解有要求的任务中；（ii）我们的知识增强评估框架确保了生物学正确性，提供了可解释、有证据支持的推理依据，并实现了高区分度，克服了传统指标的脆弱性与不透明性。因此，SC-Arena为评估单细胞生物学中的LLM提供了一个统一且可解释的框架，为开发生物学对齐、可泛化的基础模型指明了方向。