Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data.
翻译:单细胞RNA测序(scRNA-seq)数据展现出强烈且可复现的统计结构。这推动了大规模基础模型的发展,例如TranscriptFormer,它采用基于Transformer的架构,通过将基因嵌入潜在向量空间来学习基因表达的生成模型。这些嵌入表示已被用于在下游任务(如细胞类型分类、疾病状态预测和跨物种学习)中取得最先进的性能。本文探讨是否能在不使用计算密集型深度学习表示的情况下实现类似性能。通过采用依赖精细归一化和线性方法的简单、可解释流程,我们在多个常用于评估单细胞基础模型的基准测试中取得了最优或接近最优的性能,包括在涉及训练数据中未出现过的新细胞类型和生物体的分布外任务上超越基础模型。我们的研究结果强调了严格基准测试的必要性,并表明细胞身份的生物学特性可以通过单细胞基因表达数据的简单线性表示来捕捉。