LangCell: Language-Cell Pre-training for Cell Identity Understanding

Cell identity encompasses various semantic aspects of a cell, including cell type, pathway information, disease information, and more, which are essential for biologists to gain insights into its biological characteristics. Understanding cell identity from the transcriptomic data, such as annotating cell types, have become an important task in bioinformatics. As these semantic aspects are determined by human experts, it is impossible for AI models to effectively carry out cell identity understanding tasks without the supervision signals provided by single-cell and label pairs. The single-cell pre-trained language models (PLMs) currently used for this task are trained only on a single modality, transcriptomics data, lack an understanding of cell identity knowledge. As a result, they have to be fine-tuned for downstream tasks and struggle when lacking labeled data with the desired semantic labels. To address this issue, we propose an innovative solution by constructing a unified representation of single-cell data and natural language during the pre-training phase, allowing the model to directly incorporate insights related to cell identity. More specifically, we introduce LangCell, the first Language-Cell pre-training framework. LangCell utilizes texts enriched with cell identity information to gain a profound comprehension of cross-modal knowledge. Results from experiments conducted on different benchmarks show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.

翻译：细胞身份涵盖细胞类型、通路信息、疾病信息等多种语义层面，这些层面对于生物学家理解其生物学特性至关重要。从转录组数据中理解细胞身份（例如标注细胞类型）已成为生物信息学领域的重要任务。由于这些语义层面由人类专家定义，若缺乏单细胞与标签配对提供的监督信号，AI模型无法有效执行细胞身份理解任务。当前用于此任务的单细胞预训练语言模型仅基于单模态（转录组数据）进行训练，缺乏对细胞身份知识的理解。因此，这些模型必须针对下游任务进行微调，且在缺乏目标语义标签的标注数据时表现不佳。为解决这一问题，我们提出一种创新方案：在预训练阶段构建单细胞数据与自然语言的统一表征，使模型能够直接整合细胞身份相关的知识。具体而言，我们提出了首个语言-细胞预训练框架LangCell。该框架利用富含细胞身份信息的文本来深入理解跨模态知识。在不同基准测试上的实验结果表明，LangCell是唯一能在零样本细胞身份理解场景中有效工作的单细胞预训练语言模型，同时在少样本和微调细胞身份理解场景中也显著优于现有模型。