The ROUGE metric is commonly used to evaluate extractive summarization task, but it has been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the extractive summarizer. Previous research has introduced a gain-based automated metric called Sem-nCG that addresses these issues, as it is both rank and semantic aware. However, it does not consider the amount of redundancy present in a model summary and currently does not support evaluation with multiple reference summaries. It is essential to have a model summary that balances importance and diversity, but finding a metric that captures both of these aspects is challenging. In this paper, we propose a redundancy-aware Sem-nCG metric and demonstrate how the revised Sem-nCG metric can be used to evaluate model summaries against multiple references as well which was missing in previous research. Experimental results demonstrate that the revised Sem-nCG metric has a stronger correlation with human judgments compared to the previous Sem-nCG metric and traditional ROUGE and BERTScore metric for both single and multiple reference scenarios.
翻译:ROUGE指标通常用于评估抽取式摘要任务,但因其缺乏语义感知能力且忽视抽取式摘要系统的排序质量而受到批评。先前研究提出了一种基于增益的自动化指标Sem-nCG,该指标同时具备排序感知和语义感知能力,从而解决了这些问题。然而,该方法未考虑模型摘要中存在的冗余信息,且目前不支持基于多参考摘要的评估。获得一个兼顾重要性与多样性的模型摘要至关重要,但找到能同时捕捉这两个方面的评估指标颇具挑战。本文提出一种冗余感知的Sem-nCG指标,并阐述改进后的Sem-nCG指标如何用于针对多参考摘要的模型评估——这正是先前研究缺失的部分。实验结果表明,在单参考和多参考场景下,改进后的Sem-nCG指标相较于先前的Sem-nCG指标以及传统的ROUGE和BERTScore指标,与人工评价结果具有更强的相关性。