Classification systems are normally trained by minimizing the cross-entropy between system outputs and reference labels, which makes the Kullback-Leibler divergence a natural choice for measuring how closely the system can follow the data. Precision and recall provide another perspective for measuring the performance of a classification system. Non-binary references can arise from various sources, and it is often beneficial to use the soft labels for training instead of the binarized data. However, the existing definitions for precision and recall require binary reference labels, and binarizing the data can cause erroneous interpretations. We present a novel method to calculate precision, recall and F-score without quantizing the data. The proposed metrics extend the well established metrics as the definitions coincide when used with binary labels. To understand the behavior of the metrics we show simple example cases and an evaluation of different sound event detection models trained on real data with soft labels.
翻译:分类系统通常通过最小化系统输出与参考标签之间的交叉熵进行训练,这使得KL散度成为衡量系统对数据拟合程度的自然选择。精确率和召回率则为评估分类系统性能提供了另一视角。非二元参考标签可能来源于多种场景,使用软标签而非二值化数据进行训练往往更为有利。然而,现有精确率与召回率的定义要求参考标签为二元形式,对数据进行二值化处理可能导致错误的解释。本文提出一种无需数据量化即可计算精确率、召回率及F分数的新方法。所提出的指标扩展了已有度量体系——当应用于二元标签时,其定义与原有指标一致。为阐明这些度量的行为特征,我们展示了简单示例案例,并通过对真实软标签数据训练的不同声音事件检测模型进行评估加以验证。