Automatic metrics are used as proxies to evaluate abstractive summarization systems when human annotations are too expensive. To be useful, these metrics should be fine-grained, show a high correlation with human annotations, and ideally be independent of reference quality; however, most standard evaluation metrics for summarization are reference-based, and existing reference-free metrics correlate poorly with relevance, especially on summaries of longer documents. In this paper, we introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute. We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.
翻译:当人工标注成本过高时,自动评估指标常被用作评估生成式摘要系统的替代方案。理想的评估指标应具备细粒度、与人工标注高度相关,且最好能独立于参考摘要的质量;然而,目前大多数标准摘要评估指标均基于参考摘要,而现有的无参考指标与摘要相关性(尤其是长文档摘要)的相关性较弱。本文提出一种无参考评估指标,该指标与人工评估的相关性具有良好相关性,且计算成本极低。研究表明,该指标可与基于参考的指标结合使用,从而在参考摘要质量较低的场景中提升评估的鲁棒性。