Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called 'macro' metrics to rank systems (e.g., 'macro F1') but do not clearly specify what they would expect from such a 'macro' metric. This is problematic, since picking a metric can affect paper findings as well as shared task rankings, and thus any clarity in the process should be maximized. Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics, considering expectations as found expressed in papers. Equipped with a thorough understanding of the metrics, we survey metric selection in recent shared tasks of Natural Language Processing. The results show that metric choices are often not supported with convincing arguments, an issue that can make any ranking seem arbitrary. This work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.
翻译:分类系统在无数论文中受到评估。然而我们发现评估实践往往模糊不清。指标选择常缺乏论证,含混的术语容易引发误解。例如,许多研究使用所谓的"宏观"指标(如"宏观F1")对系统进行排序,但并未明确说明对这类"宏观"指标的预期。这存在严重问题,因为指标选择不仅影响论文结论,还会改变共享任务排名,因此必须最大限度提升这一过程的清晰度。从偏差与流行度这两个直观概念出发,我们结合论文中表述的预期,对常见评估指标展开分析。在充分理解各项指标的基础上,我们调研了近期自然语言处理共享任务中的指标选择情况。结果表明,指标选择通常缺乏令人信服的论据支撑,这导致任何排名都可能显得随意。本研究旨在为更明智、更透明的指标选择提供全景式指导,从而促进有意义的评估实践。