The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.
翻译:多元超几何分布描述了从划分为多个类别的离散总体中无放回抽样的过程。针对文献中的空白,我们解决了当总体大小及其组成类别大小均未知时离散分布的估计难题。本文提出了一种利用超几何似然函数的新颖解决方案,即使在严重欠采样的情况下也能完成这一估计任务。我们开发的方法考虑了这样一种数据生成过程:真实分布是依赖于连续潜变量的条件分布混合(例如协同过滤场景),并采用变分自编码器框架实现。经验数据模拟表明,该方法在总体大小估计的准确性以及学习信息丰富的潜在空间的能力上,均优于其他用于建模计数数据的似然函数。我们通过自然语言处理(NLP)应用展示了方法的通用性——推断并估计文本片段中潜在词汇的复杂度,以及生物学应用——从稀疏单细胞基因组数据中准确恢复真实的基因转录本数量。