DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis

Recently, data-centric AI methodology has been a dominant paradigm in single-cell transcriptomics analysis, which treats data representation rather than model complexity as the fundamental bottleneck. In the review of current studies, earlier sequence methods treat cells as independent entities and adapt prevalent ML models to analyze their directly inherited sequence data. Despite their simplicity and intuition, these methods overlook the latent intercellular relationships driven by the functional mechanisms of biological systems and the inherent quality issues of the raw sequence data. Therefore, a series of structured methods has emerged. Although they employ various heuristic rules to capture intricate intercellular relationships and enhance the raw sequencing data, these methods often neglect biological prior knowledge. This omission incurs substantial overhead and yields suboptimal graph representations, thereby hindering the utility of ML models. To address them, we propose DOGMA, a holistic data-centric framework designed for the structural reshaping and semantic enhancement of raw data through multi-level biological prior knowledge. Transcending reliance on stochastic heuristics, DOGMA redefines graph construction by integrating Statistical Anchors with Cell Ontology and Phylogenetic Trees to enable deterministic structure discovery and robust cross-species alignment. Furthermore, Gene Ontology is utilized to bridge the feature-level semantic gap by incorporating functional priors. In complex multi-species and multi-organ benchmarks, DOGMA achieves SOTA performance, exhibiting superior zero-shot robustness and sample efficiency while operating with significantly lower computational cost.

翻译：近年来，以数据为中心的人工智能方法已成为单细胞转录组学分析的主导范式，其将数据表征而非模型复杂度视为根本瓶颈。在现有研究综述中，早期的序列方法将细胞视为独立实体，并采用主流机器学习模型分析其直接继承的序列数据。尽管这些方法简单直观，却忽视了生物系统功能机制驱动的潜在细胞间关系以及原始序列数据固有的质量问题。因此，一系列结构化方法应运而生。尽管这些方法采用多种启发式规则来捕捉复杂的细胞间关系并增强原始测序数据，却往往忽略生物学先验知识。这种疏漏会产生大量开销并导致次优的图表示，从而阻碍机器学习模型的效用。为解决这些问题，我们提出DOGMA——一个通过多层次生物学先验知识实现原始数据结构重塑与语义增强的、整体性的以数据为中心框架。DOGMA超越了对随机启发式规则的依赖，通过整合统计锚点、细胞本体论与系统发育树来重新定义图构建，从而实现确定性结构发现与稳健的跨物种对齐。此外，该框架利用基因本体论通过融入功能先验知识来弥合特征层面的语义鸿沟。在复杂的多物种、多器官基准测试中，DOGMA取得了最先进的性能，在显著降低计算成本的同时，展现出卓越的零样本鲁棒性与样本效率。