Multi-cellular programs (MCPs) are coordinated patterns of gene expression across interacting cell types that collectively drive complex biological processes such as tissue development and immune responses. While MCPs are typically estimated from high-dimensional gene expression data using methods like sparse principal component analysis or latent factor models, these approaches often suffer from high computational costs and limited statistical power. In this work, we propose Sparse Group Principal Component Analysis (SGPCA) to estimate MCPs by leveraging their inherent group and individual sparsity. We introduce an efficient double-thresholding algorithm based on power iteration. In each iteration, a group thresholding step first identifies relevant gene groups, followed by an individual thresholding step to select active cell types. This algorithm achieves a linear computational complexity of $O(np)$, making it highly efficient and scalable for large-scale genomic analyses. We establish theoretical guarantees for SGPCA, including statistical consistency and a convergence rate that surpasses competing methods. Through extensive simulations, we demonstrate that SGPCA achieves superior estimation accuracy and improved statistical power for signal detection. Furthermore, We apply SGPCA to a Lupus study, discovering differentially expressed MCPs distinguishing Lupus patients from normal subjects.
翻译:多细胞程序(MCPs)是跨相互作用细胞类型的基因表达协调模式,共同驱动组织发育和免疫应答等复杂生物过程。虽然通常使用稀疏主成分分析或潜在因子模型等方法从高维基因表达数据中估计MCPs,但这些方法往往存在计算成本高和统计功效有限的问题。在本工作中,我们提出稀疏组主成分分析(SGPCA),通过利用MCPs固有的组稀疏性和个体稀疏性来估计它们。我们引入一种基于幂迭代的高效双重阈值化算法。在每次迭代中,组阈值化步骤首先识别相关基因组,随后个体阈值化步骤选择活跃细胞类型。该算法实现了$O(np)$的线性计算复杂度,使其在大规模基因组分析中具有高效性和可扩展性。我们为SGPCA建立了理论保证,包括统计一致性和超越竞争方法的收敛速率。通过大量模拟实验,我们证明SGPCA在信号检测方面实现了更优的估计精度和更高的统计功效。此外,我们将SGPCA应用于一项狼疮研究,发现了区分狼疮患者与正常受试者的差异表达MCPs。