SciHorizon-GENE：面向生命科学的基准测试——从基因知识到功能理解的LLM推理能力评估 (SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding)

Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.

翻译：大型语言模型（LLM）在生物医学研究中展现出日益广阔的前景，尤其是在知识驱动的解释任务中。然而，它们能否从基因层面的知识可靠地推理至功能理解——这是知识增强型细胞图谱解释的一项核心要求——在很大程度上仍未得到充分探索。为填补这一空白，我们推出了SciHorizon-GENE，一个基于权威生物数据库构建的大规模、以基因为中心的基准测试。该基准整合了超过19万个人类基因的精选知识，并包含了超过54万个问题，涵盖了与细胞类型注释、功能解释和机制导向分析相关的多种基因到功能的推理场景。受初步考察中观察到的行为模式启发，SciHorizon-GENE从四个对生物学至关重要的维度评估LLM：研究关注度敏感性、幻觉倾向、答案完整性以及文献影响力，明确针对那些限制LLM在生物解释流程中安全应用的失效模式。我们系统地评估了多种最先进的通用型和生物医学专用LLM，揭示了它们在基因层面推理能力上的显著异质性，以及在生成忠实、完整且基于文献的功能解释方面所面临的持续挑战。我们的基准为分析LLM在基因尺度上的行为建立了系统性基础，并为模型选择与开发提供了洞见，对知识增强型生物解释具有直接相关性。