Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to $90\%$ of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing $90.6\%$ of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination.
翻译:分析高维计数数据是一项挑战,基于统计模型的方法提供了一个保持可解释性的充分且高效的框架。(多元)泊松对数正态(PLN)模型正是此类模型之一:它假设计数数据由潜在的结构化高斯变量驱动,因此计数间的依赖关系完全源于潜在依赖。然而PLN模型未考虑零膨胀现象——这是现实数据集中经常观察到的特征。本文引入零膨胀泊松对数正态(ZIPLN)模型,通过添加多元零膨胀分量作为额外的伯努利潜变量来扩展原模型。零膨胀机制可设定为固定型、位点特异性型、特征特异性型或协变量依赖型。我们采用变分推断进行参数估计,该方法可扩展至数千个变量的数据集,并比较两种近似方案:(i)独立高斯与伯努利变分分布,或(ii)以伯努利分布为条件的高斯变分分布。通过合成数据评估表明,即使零膨胀比例高达观测计数的$90\%$,ZIPLN仍保持有效性。随后将ZIPLN与PLN模型应用于包含$90.6\%$零值的奶牛微生物组数据集。零膨胀机制的引入显著提高了对数似然值,降低了潜在空间的离散度,从而实现了更好的组间判别效果。