In Natural Language Processing (NLP), binary classification algorithms are often evaluated using the F1 score. Because the sample F1 score is an estimate of the population F1 score, it is not sufficient to report the sample F1 score without an indication of how accurate it is. Confidence intervals are an indication of how accurate the sample F1 score is. However, most studies either do not report them or report them using methods that demonstrate poor statistical properties. In the present study, I review current analytical methods (i.e., Clopper-Pearson method and Wald method) to construct confidence intervals for the population F1 score, propose two new analytical methods (i.e., Wilson direct method and Wilson indirect method) to do so, and compare these methods based on their coverage probabilities and interval lengths, as well as whether these methods suffer from overshoot and degeneracy. Theoretical results demonstrate that both proposed methods do not suffer from overshoot and degeneracy. Experimental results suggest that both proposed methods perform better, as compared to current methods, in terms of coverage probabilities and interval lengths. I illustrate both current and proposed methods on two suggestion mining tasks. I discuss the practical implications of these results, and suggest areas for future research.
翻译:在自然语言处理(NLP)中,二元分类算法通常使用F1得分进行评估。由于样本F1得分是总体F1得分的估计值,仅报告样本F1得分而不说明其准确性是不够的。置信区间可以指示样本F1得分的准确性。然而,大多数研究要么不报告置信区间,要么使用统计性质较差的方法进行报告。在本研究中,我回顾了构建总体F1得分置信区间的现有分析方法(即Clopper-Pearson方法和Wald方法),提出了两种新的分析方法(即Wilson直接方法和Wilson间接方法),并基于这些方法的覆盖概率、区间长度以及是否存在超调与退化问题进行了比较。理论结果表明,两种新提出的方法均不存在超调与退化问题。实验结果表明,与现有方法相比,两种新方法在覆盖概率和区间长度方面表现更优。我通过两个建议挖掘任务实例展示了现有方法和新方法,讨论了这些结果的实际意义,并提出了未来研究的方向。