In many applications in biology, engineering and economics, identifying similarities and differences between distributions of data from complex processes requires comparing finite categorical samples of discrete counts. Statistical divergences quantify the difference between two distributions. However, their estimation is very difficult and empirical methods often fail, especially when the samples are small. We develop a Bayesian estimator of the Kullback-Leibler divergence between two probability distributions that makes use of a mixture of Dirichlet priors on the distributions being compared. We study the properties of the estimator on two examples: probabilities drawn from Dirichlet distributions, and random strings of letters drawn from Markov chains. We extend the approach to the squared Hellinger divergence. Both estimators outperform other estimation techniques, with better results for data with a large number of categories and for higher values of divergences.
翻译:在生物学、工程学和经济学的众多应用中,识别复杂过程数据分布之间的相似性与差异性,需要对离散计数的有限分类样本进行比较。统计散度用于量化两个分布之间的差异。然而,其估计非常困难,且经验方法常常失效,尤其是在样本量较小的情况下。我们提出了一种贝叶斯估计器,用于估计两个概率分布之间的 Kullback-Leibler 散度,该估计器利用了所比较分布上的狄利克雷先验混合模型。我们通过两个示例研究了该估计器的性质:从狄利克雷分布中抽取的概率,以及从马尔可夫链中抽取的随机字母串。我们将该方法推广到平方 Hellinger 散度。这两种估计器的性能均优于其他估计技术,在类别数量较多且散度值较高的数据上表现更佳。