Shannon entropy is not the only entropy that is relevant to machine-learning datasets, nor possibly even the most important one. Traditional entropies such as Shannon entropy capture information represented by elements' frequencies but not the richer information encoded by their similarities and differences. Capturing the latter requires similarity-sensitive entropy (S-entropy). S-entropy can be measured using either the recently developed Leinster-Cobbold-Reeve framework (LCR) or the newer Vendi score (VS). This raises the practical question of which one to use: LCR or VS. Here we address this question conceptually, analytically, and experimentally, using 53 large and well-known imaging and tabular datasets. We find that LCR and VS values can differ by orders of magnitude and are complementary, except in limiting cases. We show that both LCR and VS results depend on how similarities are scaled, and introduce the notion of ``half-distance'' to parameterize this dependence. We prove that VS provides an upper bound on LCR for several values of the Rényi-Hill order parameter and present evidence that this bound holds for all values. We conclude that VS is preferable only when a dataset's elements can be usefully interpreted as linear combinations of a more fundamental set of ``ur-elements'' or when the system that the dataset describes has a quantum-mechanical character. In the broader case where one simply wishes to capture the rich information encoded by elements' similarities and differences as well as their frequencies, LCR is favored; nevertheless, for certain half-distances the two methods can complement each other.
翻译:香农熵并非机器学习数据集中唯一相关的熵,甚至可能不是最重要的熵。传统熵(如香农熵)捕捉了元素频率所表示的信息,但未能涵盖其相似性与差异性所编码的更丰富信息。捕捉后者需要相似性敏感熵(S-熵)。S-熵可通过近期发展的Leinster-Cobbold-Reeve框架(LCR)或较新的Vendi分数(VS)进行度量。这引出了一个实际问题:应使用LCR还是VS?本文从概念、分析和实验三个维度探讨该问题,使用了53个大型知名图像与表格数据集。我们发现,除极限情况外,LCR与VS的数值可能相差数个数量级且具有互补性。我们证明LCR和VS的结果均取决于相似度的缩放方式,并引入“半距离”概念来参数化这种依赖关系。我们证明了在多个Rényi-Hill阶参数取值下,VS为LCR提供了上界,并有证据表明该界限对所有参数值均成立。我们得出结论:仅当数据集的元素可有效解释为更基础“元元素”集合的线性组合,或当数据集描述的系统具有量子力学特征时,VS更具优势。在更普遍的情况下,若仅希望同时捕捉元素相似性、差异性及频率所编码的丰富信息,则LCR更受青睐;尽管如此,在特定半距离条件下,两种方法可互为补充。