Human evaluation plays a crucial role in Natural Language Processing (NLP) as it assesses the quality and relevance of developed systems, thereby facilitating their enhancement. However, the absence of widely accepted human evaluation metrics in NLP hampers fair comparisons among different systems and the establishment of universal assessment standards. Through an extensive analysis of existing literature on human evaluation metrics, we identified several gaps in NLP evaluation methodologies. These gaps served as motivation for developing our own hierarchical evaluation framework. The proposed framework offers notable advantages, particularly in providing a more comprehensive representation of the NLP system's performance. We applied this framework to evaluate the developed Machine Reading Comprehension system, which was utilized within a human-AI symbiosis model. The results highlighted the associations between the quality of inputs and outputs, underscoring the necessity to evaluate both components rather than solely focusing on outputs. In future work, we will investigate the potential time-saving benefits of our proposed framework for evaluators assessing NLP systems.
翻译:人工评估在自然语言处理(NLP)中扮演着关键角色,能够评估所开发系统的质量和相关性,从而促进其改进。然而,NLP领域缺乏广泛认可的人工评估指标,这阻碍了不同系统之间的公平比较以及通用评估标准的建立。通过对现有关于人工评估指标的文献进行广泛分析,我们发现了NLP评估方法中的若干空白。这些空白促使我们开发了自有的层次化评估框架。该框架具有显著优势,尤其能更全面地呈现NLP系统的性能。我们将该框架应用于所开发的机器阅读理解系统(该系统在人机协同模型中使用)的评估。结果凸显了输入与输出质量之间的关联,强调需对两者均进行评估,而非仅关注输出。在未来的工作中,我们将探究所提框架在评估NLP系统时可能为评估者带来的时间节省效益。