Estimating the Shannon entropy of a discrete distribution from which we have only observed a small sample is challenging. Estimating other information-theoretic metrics, such as the Kullback-Leibler divergence between two sparsely sampled discrete distributions, is even harder. Existing approaches to address these problems have shortcomings: they are biased, heuristic, work only for some distributions, and/or cannot be applied to all information-theoretic metrics. Here, we propose a fast, semi-analytical estimator for sparsely sampled distributions that is efficient, precise, and general. Its derivation is grounded in probabilistic considerations and uses a hierarchical Bayesian approach to extract as much information as possible from the few observations available. Our approach provides estimates of the Shannon entropy with precision at least comparable to the state of the art, and most often better. It can also be used to obtain accurate estimates of any other information-theoretic metric, including the notoriously challenging Kullback-Leibler divergence. Here, again, our approach performs consistently better than existing estimators.
翻译:从仅观察到少量样本的离散分布中估计香农熵具有挑战性。估计其他信息论度量(如两个稀疏采样离散分布之间的Kullback-Leibler散度)则更为困难。现有解决这些问题的方法存在缺陷:存在偏差、依赖启发式规则、仅适用于某些分布,且(或)无法应用于所有信息论度量。本文提出一种快速、半解析的稀疏采样分布估计器,该估计器高效、精确且具有通用性。其推导基于概率论考量,并采用分层贝叶斯方法从有限观测中提取尽可能多的信息。我们的方法提供的香农熵估计精度至少与现有最优方法相当,且通常更优。该方法还可用于获取任何其他信息论度量的准确估计,包括极具挑战性的Kullback-Leibler散度。在此类场景中,我们的方法同样始终优于现有估计器。