Single-cell gene expression data are often characterized by large matrices, where the number of cells may be lower than the number of genes of interest. Factorization models have emerged as powerful tools to condense the available information through a sparse decomposition into lower rank matrices. In this work, we adapt and implement a recent Bayesian class of generalized factor models to count data and, specifically, to model the covariance between genes. The developed methodology also allows one to include exogenous information within the prior, such that recognition of covariance structures between genes is favoured. In this work, we use biological pathways as external information to induce sparsity patterns within the loadings matrix. This approach facilitates the interpretation of loadings columns and the corresponding latent factors, which can be regarded as unobserved cell covariates. We demonstrate the effectiveness of our model on single-cell RNA sequencing data obtained from lung adenocarcinoma cell lines, revealing promising insights into the role of pathways in characterizing gene relationships and extracting valuable information about unobserved cell traits.
翻译:单细胞基因表达数据通常表现为大规模矩阵,其中细胞数量可能低于目标基因数量。因子分析模型通过将信息稀疏分解为低秩矩阵,成为浓缩可用信息的强大工具。本研究针对计数数据,特别是基因间协方差建模问题,调整并实现了一种最新的贝叶斯广义因子模型类别。该方法还允许在先验中纳入外源信息,从而有利于识别基因间的协方差结构。研究中我们利用生物通路作为外部信息,在载荷矩阵中诱导稀疏模式。这一方法有助于解释载荷列及对应的潜在因子(可视为未观测的细胞协变量)。我们通过在肺腺癌细胞系获得的单细胞RNA测序数据上验证模型有效性,揭示了通路在表征基因关系及提取未观测细胞特征有价值信息方面的关键作用。