Efficient and accurate estimation of multivariate empirical probability distributions is fundamental to the calculation of information-theoretic measures such as mutual information and transfer entropy. Common techniques include variations on histogram estimation which, whilst computationally efficient, are often unable to precisely capture the probability density of samples with high correlation, kurtosis or fine substructure, especially when sample sizes are small. Adaptive partitions, which adjust heuristically to the sample, can reduce the bias imparted from the geometry of the histogram itself, but these have commonly focused on the location, scale and granularity of the partition, the effects of which are limited for highly correlated distributions. In this paper, I reformulate the differential entropy estimator for the special case of an equiprobable histogram, using a k-d tree to partition the sample space into bins of equal probability mass. By doing so, I expose an implicit rotational orientation parameter, which is conjectured to be suboptimally specified in the typical marginal alignment. I propose that the optimal orientation minimises the variance of the bin volumes, and demonstrate that improved entropy estimates can be obtained by rotationally aligning the partition to the sample distribution accordingly. Such optimal partitions are observed to be more accurate than existing techniques in estimating entropies of correlated bivariate Gaussian distributions with known theoretical values, across varying sample sizes (99% CI).
翻译:多元经验概率分布的高效准确估计是计算互信息和转移熵等信息论测度的基础。常用技术包括直方图估计的变体,这类方法虽计算高效,但往往难以精确捕捉具有高相关性、高峰度或精细子结构的样本概率密度,尤其在样本量较小时更为显著。能够根据样本进行启发式调整的自适应划分可降低直方图几何本身带来的偏差,但这些方法通常关注划分的位置、尺度和粒度,对于高度相关的分布其效果有限。本文针对等概率直方图的特殊情形重新表述微分熵估计器,采用k-d树将样本空间划分为等概率质量的小区间。通过这一操作,本文揭示了一个隐含的旋转定向参数,该参数在典型的边界对齐中被推测为次优设定。本文提出最优定向可使分区体积的方差最小化,并证明通过将划分与样本分布相应地旋转对齐,可以获得改进的熵估计。在具有已知理论值的相关二元高斯分布熵估计中,观察到此类最优划分在不同样本量下(99%置信区间)比现有技术具有更高的准确性。