In Natural Language Processing (NLP) classification tasks such as topic categorisation and sentiment analysis, model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use. This lack suggests there has not been sufficient examination of the underlying heuristics which each metric encodes. To address this we compare several standard classification metrics with more 'exotic' metrics and demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance. To show how important the choice of metric is, we perform extensive experiments on a wide range of NLP tasks including a synthetic scenario, natural language understanding, question answering and machine translation. Across these tasks we use a superset of metrics to rank models and find that Informedness best captures the ideal model characteristics. Finally, we release a Python implementation of Informedness following the SciKitLearn classifier format.
翻译:在自然语言处理(NLP)的分类任务中,如主题分类和情感分析,模型的泛化能力通常通过准确率、F值或AUC-ROC等标准指标来衡量。由于指标多样性及其应用的随意性,表明NLP领域内对于使用单一最佳指标尚未达成共识。这种缺乏共识暗示我们尚未充分审视每个指标所蕴含的底层启发式规则。为解决这一问题,我们比较了若干标准分类指标与更“非传统”的指标,并证明随机猜测归一化的Informedness指标是任务性能的一个简约基线。为了展示指标选择的重要性,我们在广泛的NLP任务上进行了大量实验,包括合成场景、自然语言理解、问答和机器翻译。在这些任务中,我们使用指标的超集对模型进行排序,发现Informedness最能捕捉理想模型的特征。最后,我们按照SciKitLearn分类器格式发布了Informedness的Python实现。