Recent advancements in single-cell multi-omics, particularly RNA-seq, have provided profound insights into cellular heterogeneity and gene regulation. While pre-trained language model (PLM) paradigm based single-cell foundation models have shown promise, they remain constrained by insufficient integration of in-depth individual profiles and neglecting the influence of noise within multi-modal data. To address both issues, we propose an Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL). It is built based on a cross-modal Cell-Language pre-training framework, which comprises two key innovations: (1) leveraging Large Language Models (LLMs) based workflow with retrieval-augmented generation (RAG) enriches cell textual descriptions using open-world knowledge; (2) devising a Cross-modal Robust Alignment (CRA) objective that incorporates sample reliability assessment, curriculum learning, and coupled momentum contrastive learning to strengthen the model's resistance to noisy data. After pretraining on 32M cell-text pairs, OKR-CELL obtains cutting-edge results across 6 evaluation tasks. Beyond standard benchmarks such as cell clustering, cell-type annotation, batch-effect correction, and few-shot annotation, the model also demonstrates superior performance in broader multi-modal applications, including zero-shot cell-type annotation and bidirectional cell-text retrieval.
翻译:近年来,单细胞多组学技术(尤其是RNA测序)的进展为细胞异质性和基因调控机制提供了深刻见解。尽管基于预训练语言模型范式的单细胞基础模型展现出潜力,但仍受限于对深度个体特征整合不足以及多模态数据内部噪声影响的忽视。为同时解决这两个问题,我们提出一种开放世界语言知识辅助的鲁棒单细胞基础模型。该模型构建于跨模态细胞-语言预训练框架之上,包含两项关键创新:(1)采用基于大语言模型的工作流程,结合检索增强生成技术,利用开放世界知识丰富细胞文本描述;(2)设计跨模态鲁棒对齐目标,整合样本可靠性评估、课程学习及耦合动量对比学习,以增强模型对噪声数据的抵抗能力。经过对3200万细胞-文本对的预训练后,该模型在6项评估任务中均取得前沿性能。除细胞聚类、细胞类型注释、批次效应校正和少样本注释等标准基准测试外,该模型在更广泛的多模态应用场景(包括零样本细胞类型注释和双向细胞-文本检索)中也展现出卓越性能。