While very popular for evaluating extractive summarization task, the ROUGE metric has long been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the summarizer. Thanks to previous research that has addressed these issues by proposing a gain-based automated metric called Sem-nCG, which is both rank and semantic aware. However, Sem-nCG does not consider the amount of redundancy present in a model-generated summary and currently does not support evaluation with multiple reference summaries. Unfortunately, addressing both these limitations simultaneously is not trivial. Therefore, in this paper, we propose a redundancy-aware Sem-nCG metric and demonstrate how this new metric can be used to evaluate model summaries against multiple references. We also explore different ways of incorporating redundancy into the original metric through extensive experiments. Experimental results demonstrate that the new redundancy-aware metric exhibits a higher correlation with human judgments than the original Sem-nCG metric for both single and multiple reference scenarios.
翻译:尽管ROUGE指标在评估抽取式摘要任务中非常流行,但长期以来因其缺乏语义感知能力以及忽视摘要生成器的排序质量而受到批评。先前研究通过提出一种名为Sem-nCG的增益型自动评估指标解决了这些问题,该指标同时具备排序感知和语义感知能力。然而,Sem-nCG并未考虑模型生成摘要中存在的冗余量,且目前不支持使用多个参考摘要进行评估。遗憾的是,同时解决这两个局限性并非易事。因此,本文提出一种冗余感知的Sem-nCG指标,并展示了如何利用这一新指标针对多个参考摘要评估模型生成摘要。我们还通过大量实验探索了将冗余信息融入原始指标的不同方法。实验结果表明,在单参考和多参考场景下,新提出的冗余感知指标与人工判断的相关性均高于原始Sem-nCG指标。