Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness Score, to measure contrast which does not take into account the sensitivity to meaning-preserving lexical variations. In this work, we propose an automated evaluation metric CASPR to better measure contrast between a pair of summaries. Our metric is based on a simple and light-weight method that leverages natural language inference (NLI) task to measure contrast by segmenting reviews into single-claim sentences and carefully aggregating NLI scores between them to come up with a summary-level score. We compare CASPR with Distinctiveness Score and a simple yet powerful baseline based on BERTScore. Our results on a prior dataset CoCoTRIP demonstrate that CASPR can more reliably capture the contrastiveness of the summary pairs compared to the baselines.
翻译:摘要:从一组源评论中总结关于实体(例如酒店、手机)的对比性观点,通常称为对比性摘要,能显著辅助用户进行决策。然而,在不依赖人工评估的情况下可靠地衡量输出摘要的对比性,仍是一个未解决的研究问题。已有工作提出了基于词元重叠的指标——区分度分数,用于衡量对比性,但该方法未能考虑对保持语义的词汇变体的敏感性。在本研究中,我们提出一种自动化评估指标CASPR,以更好地衡量摘要对之间的对比性。该指标基于一种简单且轻量的方法,利用自然语言推理任务将评论分割为单一主张的句子,并仔细聚合句子间的自然语言推理分数,从而得出摘要级分数。我们将CASPR与区分度分数及基于BERTScore的简单而强大的基线方法进行比较。在已有数据集CoCoTRIP上的实验结果表明,与基线方法相比,CASPR能更可靠地捕捉摘要对的对比性。