The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.
翻译:多元超几何分布描述了从离散总体中无放回抽样,该总体中的元素被划分为多个类别。针对文献中的空白,我们解决了当总体大小及其组成类别大小均未知时,估计离散分布这一挑战。本文提出了一种使用超几何似然的新颖解决方案,即使存在严重欠采样,也能应对此估计问题。我们发展了一种方法,以适应数据生成过程,其中真实情况是条件于连续隐变量的分布混合(例如协同过滤),并采用了变分自编码器框架。实证数据模拟表明,我们的方法在总体大小估计的准确性以及学习有效隐空间的能力方面,均优于用于建模计数数据的其他似然函数。我们通过自然语言处理中的应用(从文本片段中推断并估计隐词汇的复杂性)以及生物学中的应用(从稀疏单细胞基因组数据中准确恢复真实基因转录本数量),展示了该方法的多功能性。