When faced with a large number of product reviews, it is not clear that a human can remember all of them and weight opinions representatively to write a good reference summary. We propose an automatic metric to test the prevalence of the opinions that a summary expresses, based on counting the number of reviews that are consistent with each statement in the summary, while discrediting trivial or redundant statements. To formulate this opinion prevalence metric, we consider several existing methods to score the factual consistency of a summary statement with respect to each individual source review. On a corpus of Amazon product reviews, we gather multiple human judgments of the opinion consistency, to determine which automatic metric best expresses consistency in product reviews. Using the resulting opinion prevalence metric, we show that a human authored summary has only slightly better opinion prevalence than randomly selected extracts from the source reviews, and previous extractive and abstractive unsupervised opinion summarization methods perform worse than humans. We demonstrate room for improvement with a greedy construction of extractive summaries with twice the opinion prevalence achieved by humans. Finally, we show that preprocessing source reviews by simplification can raise the opinion prevalence achieved by existing abstractive opinion summarization systems to the level of human performance.
翻译:当面对大量产品评论时,人类能否记住所有评论并代表性权衡观点以撰写优质的参考摘要尚不明确。我们提出一种自动评估指标,用于检验摘要所表达观点的普遍性,其核心是统计与摘要中每条陈述一致的评论数量,同时排除琐碎或冗余的陈述。为构建这一观点普遍性指标,我们考察了多种现有方法,用于评估摘要陈述相对于每个原始源评论的事实一致性。在亚马逊产品评论语料库上,我们收集了多个人类关于观点一致性的判断,以确定哪种自动指标最能表达产品评论中的一致性。基于所得的观点普遍性指标,我们发现人类撰写的摘要相比随机选取的源评论摘录仅略具优势,而现有的抽取式和非监督式抽象观点摘要方法表现均逊于人类。我们通过贪婪构建的抽取式摘要展示改进空间,其观点普遍性达到人类的两倍。最后,我们证明通过简化预处理源评论,可将现有抽象观点摘要系统的观点普遍性提升至人类水平。