With the rapid development of large language models (LLMs), their application to cell type annotation has drawn increasing attention. However, general-purpose LLMs often face limitations in this specific task due to the lack of guidance from external domain knowledge. To enable more accurate and fully automated cell type annotation, we develop a globally connected knowledge graph comprising 18850 biological information nodes, including cell types, gene markers, features, and other related entities, along with 48,944 edges connecting these nodes, which is used by LLMs to retrieve entities associated with differential genes for cell reconstruction. Additionally, a multi-task reasoning workflow is designed to optimise the annotation process. Compared to general-purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across multiple tissue types, while more closely aligning with the cognitive logic of manual annotation. Meanwhile, it narrows the performance gap between large and small LLMs in cell type annotation, offering a paradigm for structured knowledge integration and reasoning in bioinformatics.
翻译:随着大语言模型的快速发展,其在细胞类型注释领域的应用日益受到关注。然而,通用型大语言模型由于缺乏外部领域知识引导,在该专项任务中常面临局限性。为实现更精准且全自动的细胞类型注释,我们构建了一个包含18850个生物学信息节点(涵盖细胞类型、基因标记、特征及其他相关实体)及48944条连接边的全局知识图谱,供大语言模型检索与差异基因关联的实体以进行细胞重建。此外,我们设计了一种多任务推理工作流来优化注释过程。与通用型大语言模型相比,本方法在多组织类型上将人工评估分数提升最高达0.21分,语义相似度提高6.1%,同时更贴近人工注释的认知逻辑。此外,该方法缩小了大语言模型与小语言模型在细胞类型注释中的性能差距,为生物信息学领域的结构化知识整合与推理提供了新范式。