Multivariate count data with many zeros frequently occur in a variety of application areas such as text mining with a document-term matrix and cluster analysis with microbiome abundance data. Exponential family PCA (Collins et al., 2001) is a widely used dimension reduction tool to understand and capture the underlying low-rank structure of count data. It produces principal component scores by fitting Poisson regression models with estimated loadings as covariates. This tends to result in extreme scores for sparse count data significantly deviating from true scores. We consider two major sources of bias in this estimation procedure and propose ways to reduce their effects. First, the discrepancy between true loadings and their estimates under a limited sample size largely degrades the quality of score estimates. By treating estimated loadings as covariates with bias and measurement errors, we debias score estimates, using the iterative bootstrap method for loadings and considering classical measurement error models. Second, the existence of MLE bias is often ignored in score estimation, but this bias could be removed through well-known MLE bias reduction methods. We demonstrate the effectiveness of the proposed bias correction procedure through experiments on both simulated data and real data.
翻译:在文档-词项矩阵的文本挖掘和微生物组丰度数据的聚类分析等众多应用领域中,常出现含大量零值的多元计数数据。指数族主成分分析(Collins等,2001)作为一种广泛应用的降维工具,能够揭示并捕捉计数数据的潜在低秩结构。该方法通过以估计载荷为协变量拟合泊松回归模型生成主成分得分,但这一过程易导致稀疏计数数据的得分估计值严重偏离真实值。本文分析了该估计过程中两个主要的偏差来源并提出相应修正策略。首先,有限样本量下真实载荷与估计值间的差异会显著降低得分估计质量。通过将估计载荷视为含偏倚与测量误差的协变量,我们利用载荷的迭代自助法结合经典测量误差模型实现得分去偏。其次,得分估计中常被忽视的极大似然估计偏差可通过成熟的MLE偏差缩减方法消除。基于模拟数据与真实数据的实验验证了所提偏差校正流程的有效性。