We propose a unified probabilistic framework for sparse count tensors with excess zeros, motivated by single-cell Hi-C data. The observed data are naturally represented as a three-way tensor indexed by genomic loci pairs and cells, exhibiting pronounced sparsity, zero inflation, and cell-to-cell heterogeneity. We introduce a zero-inflated Poisson tensor model that integrates low-rank CP structure, cluster-specific latent embeddings, and smoothness along ordered genomic loci, thereby jointly capturing multiway dependence, heterogeneity, and structured variation. We develop a Bayes-optimal procedure for distinguishing structural from technical zeros, enabling principled inference and uncertainty quantification. We establish identifiability of the model parameters and derive consistency rates for the proposed estimators in a high-dimensional regime. Simulation studies and analyses of single-cell Hi-C data demonstrate improved performance in zero detection, latent structure recovery, and downstream tasks such as clustering and 3D chromatin organization inference. The proposed framework provides a flexible approach for multiway count data with excess zeros and structured dependencies, and suggests several directions for future work, including mixture-based modeling of cell populations and scalable computation for large-scale applications.
翻译:我们提出了一种统一的概率框架,用于处理稀疏且含有过多零的计数张量,其研究动机来自单细胞Hi-C数据。观测数据自然表示为以基因组位点对和细胞为索引的三阶张量,表现出显著的稀疏性、零膨胀以及细胞间异质性。我们引入了一种零膨胀泊松张量模型,该模型整合了低秩CP结构、聚类特异性的潜在嵌入以及沿有序基因组位点的平滑性,从而联合捕捉多路依赖、异质性和结构化变异。我们开发了一种贝叶斯最优程序,用于区分结构性零与技术性零,从而能够进行原理性推断和不确定性量化。我们建立了模型参数的可辨识性,并推导了高维条件下所提估计量的一致性速率。模拟研究和单细胞Hi-C数据分析表明,该方法在零值检测、潜在结构恢复以及下游任务(如聚类和三维染色质组织推断)方面性能更优。该框架为具有过多零值和结构化依赖的多路计数数据提供了一种灵活的方法,并为未来的研究指出了多个方向,包括基于混合模型的细胞群体建模以及针对大规模应用的可扩展计算。