The current Poisson factor models often assume that the factors are unknown, which overlooks the explanatory potential of certain observable covariates. This study focuses on high dimensional settings, where the number of the count response variables and/or covariates can diverge as the sample size increases. A covariate-augmented overdispersed Poisson factor model is proposed to jointly perform a high-dimensional Poisson factor analysis and estimate a large coefficient matrix for overdispersed count data. A group of identifiability conditions are provided to theoretically guarantee computational identifiability. We incorporate the interdependence of both response variables and covariates by imposing a low-rank constraint on the large coefficient matrix. To address the computation challenges posed by nonlinearity, two high-dimensional latent matrices, and the low-rank constraint, we propose a novel variational estimation scheme that combines Laplace and Taylor approximations. We also develop a criterion based on a singular value ratio to determine the number of factors and the rank of the coefficient matrix. Comprehensive simulation studies demonstrate that the proposed method outperforms the state-of-the-art methods in estimation accuracy and computational efficiency. The practical merit of our method is demonstrated by an application to the CITE-seq dataset. A flexible implementation of our proposed method is available in the R package \emph{COAP}.
翻译:当前泊松因子模型通常假设因子未知,这忽视了某些可观测协变量的解释潜力。本研究聚焦于高维设定,其中计数响应变量和/或协变量的数量可随样本量增加而发散。本文提出了一种协变量增强过离散泊松因子模型,以联合进行高维泊松因子分析并估计过离散计数数据的大系数矩阵。我们提供了一组可辨识性条件,从理论上保证计算可辨识性。通过对大系数矩阵施加低秩约束,我们纳入了响应变量与协变量之间的相互依赖关系。为应对非线性、两个高维潜在矩阵以及低秩约束带来的计算挑战,我们提出了一种结合拉普拉斯近似和泰勒近似的新型变分估计方案。我们还开发了一种基于奇异值比的标准,用于确定因子数量和系数矩阵的秩。综合模拟研究表明,所提方法在估计精度和计算效率上均优于现有最优方法。通过在CITE-seq数据集上的应用,我们展示了该方法的实用价值。所提方法的灵活实现已收录于R包\emph{COAP}中。