Tensor factorizations (TF) are powerful tools for the efficient representation and analysis of multidimensional data. However, classic TF methods based on maximum likelihood estimation underperform when applied to zero-inflated count data, such as single-cell RNA sequencing (scRNA-seq) data. Additionally, the stochasticity inherent in TFs results in factors that vary across repeated runs, making interpretation and reproducibility of the results challenging. In this paper, we introduce Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel approach for the factorization of high-dimensional count data with excess zeros. To address the challenge of stochasticity, we introduce Consensus Zero Inflated Poisson Tensor Factorization (C-ZIPTF), which combines ZIPTF with a consensus-based meta-analysis. We evaluate our proposed ZIPTF and C-ZIPTF on synthetic zero-inflated count data and synthetic and real scRNA-seq data. ZIPTF consistently outperforms baseline matrix and tensor factorization methods in terms of reconstruction accuracy for zero-inflated data. When the probability of excess zeros is high, ZIPTF achieves up to $2.4\times$ better accuracy. Additionally, C-ZIPTF significantly improves the consistency and accuracy of the factorization. When tested on both synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently recover known and biologically meaningful gene expression programs.
翻译:张量分解(TF)是高维数据高效表示与分析的强大工具。然而,基于最大似然估计的经典TF方法在处理零膨胀计数数据(如单细胞RNA测序数据)时表现欠佳。此外,TF固有的随机性导致因子在重复运行间存在差异,使得结果解释与可复现性面临挑战。本文提出零膨胀泊松张量分解(ZIPTF)——一种面向过量零值高维计数数据分解的新方法。为克服随机性问题,我们引入共识零膨胀泊松张量分解(C-ZIPTF),该方案将ZIPTF与基于共识的荟萃分析相结合。我们在合成零膨胀计数数据、合成及真实scRNA-seq数据中评估了所提出的ZIPTF与C-ZIPTF方法。对于零膨胀数据,ZIPTF在重构精度上持续优于基线矩阵与张量分解方法。当过量零值概率较高时,ZIPTF的精度提升可达$2.4\times$。此外,C-ZIPTF显著提升了因子分解的一致性与准确性。在合成与真实scRNA-seq数据上的测试表明,ZIPTF与C-ZIPTF能够稳定恢复已知且具有生物学意义的基因表达程序。