Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.
翻译:聚类是单细胞RNA测序分析的基础,是识别细胞群体和解析组织异质性的关键。然而,现有方法主要挖掘数值统计模式,由于忽略了基因编码的内在生物学功能而存在语义不可知性。尽管大语言模型(LLMs)提供了有前景的语义能力,但其在细胞聚类中的直接应用受到生成式预训练目标与判别式下游任务之间结构不匹配的阻碍。为弥补这一差距,我们提出scLLM-DSC——一种新颖的基于大语言模型知识的跨模态深度结构聚类框架。与数据驱动范式不同,scLLM-DSC通过协同两种视图建立语义基础表示:源自NCBI基因先验和上下文化Cell2Sentence嵌入的知识驱动语义视图,以及通过图引导编码器提取的结构感知拓扑视图。关键地,我们引入跨模态对比对齐机制,在统一潜在空间中强制实现生物学语义与转录组特征之间的一致性。大量基准测试表明,scLLM-DSC在聚类准确性上显著优于十一种最先进基线方法。