Dimension reduction techniques are among the most essential analytical tools in the analysis of high-dimensional data. Generalized principal component analysis (PCA) is an extension to standard PCA that has been widely used to identify low-dimensional features in high-dimensional discrete data, such as binary, multi-category and count data. For microbiome count data in particular, the multinomial PCA is a natural counterpart of the standard PCA. However, this technique fails to account for the excessive number of zero values, which is frequently observed in microbiome count data. To allow for sparsity, zero-inflated multivariate distributions can be used. We propose a zero-inflated probabilistic PCA model for latent factor analysis. The proposed model is a fully Bayesian factor analysis technique that is appropriate for microbiome count data analysis. In addition, we use the mean-field-type variational family to approximate the marginal likelihood and develop a classification variational approximation algorithm to fit the model. We demonstrate the efficiency of our procedure for predictions based on the latent factors and the model parameters through simulation experiments, showcasing its superiority over competing methods. This efficiency is further illustrated with two real microbiome count datasets. The method is implemented in R.
翻译:降维技术是分析高维数据时最重要的分析工具之一。广义主成分分析(PCA)是标准PCA的扩展,已被广泛应用于识别高维离散数据(如二项、多项和计数数据)中的低维特征。尤其是对于微生物组计数数据,多项PCA是标准PCA的自然对应方法。然而,该方法未能解释微生物组计数数据中常见的过多零值。为处理稀疏性,可采用零膨胀多元分布。我们提出了一种用于潜在因子分析的零膨胀概率PCA模型。所提模型是一种全贝叶斯因子分析技术,适用于微生物组计数数据分析。此外,我们利用平均场型变分族来近似边际似然,并开发了一种分类变分近似算法来拟合模型。通过模拟实验,我们展示了基于潜在因子和模型参数的预测效率,并证明了其优于竞争方法。该效率进一步通过两个真实微生物组计数数据集得到验证。该方法已在R中实现。