Zero-inflation in the Multivariate Poisson Lognormal Family

Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to $90\%$ of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing $90.6\%$ of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination.

翻译：分析高维计数数据是一项挑战，基于统计模型的方法提供了一个保持可解释性的充分且高效的框架。(多元)泊松对数正态(PLN)模型正是此类模型之一：它假设计数数据由潜在的结构化高斯变量驱动，因此计数间的依赖关系完全源于潜在依赖。然而PLN模型未考虑零膨胀现象——这是现实数据集中经常观察到的特征。本文引入零膨胀泊松对数正态(ZIPLN)模型，通过添加多元零膨胀分量作为额外的伯努利潜变量来扩展原模型。零膨胀机制可设定为固定型、位点特异性型、特征特异性型或协变量依赖型。我们采用变分推断进行参数估计，该方法可扩展至数千个变量的数据集，并比较两种近似方案：(i)独立高斯与伯努利变分分布，或(ii)以伯努利分布为条件的高斯变分分布。通过合成数据评估表明，即使零膨胀比例高达观测计数的$90\%$，ZIPLN仍保持有效性。随后将ZIPLN与PLN模型应用于包含$90.6\%$零值的奶牛微生物组数据集。零膨胀机制的引入显著提高了对数似然值，降低了潜在空间的离散度，从而实现了更好的组间判别效果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日