In data mining, when binary prediction rules are used to predict a binary outcome, many performance measures are used in a vast array of literature for the purposes of evaluation and comparison. Some examples include classification accuracy, precision, recall, F measures, and Jaccard index. Typically, these performance measures are only approximately estimated from a finite dataset, which may lead to findings that are not statistically significant. In order to properly quantify such statistical uncertainty, it is important to provide confidence intervals associated with these estimated performance measures. We consider statistical inference about general performance measures used in data mining, with both individual and joint confidence intervals. These confidence intervals are based on asymptotic normal approximations and can be computed fast, without needs to do bootstrap resampling. We study the finite sample coverage probabilities for these confidence intervals and also propose a `blurring correction' on the variance to improve the finite sample performance. This 'blurring correction' generalizes the plus-four method from binomial proportion to general performance measures used in data mining. Our framework allows multiple performance measures of multiple classification rules to be inferred simultaneously for comparisons.
翻译:在数据挖掘中,当使用二元预测规则来预测二元结果时,大量文献采用多种性能度量进行评估与比较。常见示例包括分类准确率、精确率、召回率、F度量和Jaccard指数。通常,这些性能度量仅基于有限数据集进行近似估计,可能导致统计显著性不足的结论。为准确量化此类统计不确定性,必须提供与这些估计性能度量相关的置信区间。本文研究数据挖掘中通用性能度量的统计推断方法,涵盖个体与联合置信区间。这些置信区间基于渐近正态近似构建,无需进行自助重抽样即可快速计算。我们考察了这些置信区间在有限样本下的覆盖概率,并提出通过"模糊校正"方差来改进有限样本性能。该"模糊校正"将二项比例的加四方法推广至数据挖掘中的通用性能度量。我们的框架支持同时对多个分类规则的多种性能度量进行推断比较。