The performance of machine learning classification algorithms are evaluated by estimating metrics, often from the confusion matrix, using training data and cross-validation. However, these do not prove that the best possible performance has been achieved. Fundamental limits to error rates can be estimated using information distance measures. To this end, the confusion matrix has been formulated to comply with the Chernoff-Stein Lemma. This links the error rates to the Kullback-Leibler divergences between the probability density functions describing the two classes. This leads to a key result that relates Cohen's Kappa to the Resistor Average Distance which is the parallel resistor combination of the two Kullback-Leibler divergences. The Resistor Average Distance has units of bits and is estimated from the same training data used by the classification algorithm, using kNN estimates of the KullBack-Leibler divergences. The classification algorithm gives the confusion matrix and Kappa. Theory and methods are discussed in detail and then applied to Monte Carlo data and real datasets. Four very different real datasets - Breast Cancer, Coronary Heart Disease, Bankruptcy, and Particle Identification - are analysed, with both continuous and discrete values, and their classification performance compared to the expected theoretical limit. In all cases this analysis shows that the algorithms could not have performed any better due to the underlying probability density functions for the two classes. Important lessons are learnt on how to predict the performance of algorithms for imbalanced data using training datasets that are approximately balanced. Machine learning is very powerful but classification performance ultimately depends on the quality of the data and the relevance of the variables to the problem.
翻译:机器学习分类算法的性能通常基于训练数据和交叉验证从混淆矩阵中估算指标来评估。然而,这些方法并不能证明已实现最佳性能。通过信息距离度量可估算误差率的基本极限。为此,我们构建了符合Chernoff-Stein引理的混淆矩阵,将误差率与描述两类别的概率密度函数之间的Kullback-Leibler散度相关联,进而得到关键结果:将Cohen's Kappa与电阻平均距离(即两个Kullback-Leibler散度的并联电阻组合)联系起来。电阻平均距离以比特为单位,通过分类算法使用的相同训练数据,基于kNN估计的Kullback-Leibler散度计算得到。分类算法输出混淆矩阵和Kappa值。本文详细讨论了相关理论和方法,并将其应用于蒙特卡洛数据和真实数据集。我们分析了四个截然不同的真实数据集(乳腺癌、冠心病、破产和企业破产预测、粒子识别),涵盖连续值和离散值,并将其分类性能与理论极限进行比较。所有案例均表明,由于两个类别的底层概率密度函数限制,算法无法实现更优性能。研究还揭示了如何使用近似平衡的训练数据集预测不平衡数据上算法性能的重要启示。机器学习功能强大,但分类性能最终取决于数据质量和变量与问题的相关性。