Incremental Extractive Opinion Summarization Using Cover Trees

Extractive opinion summarization involves automatically producing a summary of text about an entity (e.g., a product's reviews) by extracting representative sentences that capture prevalent opinions in the review set. Typically, in online marketplaces user reviews accumulate over time, and opinion summaries need to be updated periodically to provide customers with up-to-date information. In this work, we study the task of extractive opinion summarization in an incremental setting, where the underlying review set evolves over time. Many of the state-of-the-art extractive opinion summarization approaches are centrality-based, such as CentroidRank (Radev et al., 2004; Chowdhury et al., 2022). CentroidRank performs extractive summarization by selecting a subset of review sentences closest to the centroid in the representation space as the summary. However, these methods are not capable of operating efficiently in an incremental setting, where reviews arrive one at a time. In this paper, we present an efficient algorithm for accurately computing the CentroidRank summaries in an incremental setting. Our approach, CoverSumm, relies on indexing review representations in a cover tree and maintaining a reservoir of candidate summary review sentences. CoverSumm's efficacy is supported by a theoretical and empirical analysis of running time. Empirically, on a diverse collection of data (both real and synthetically created to illustrate scaling considerations), we demonstrate that CoverSumm is up to 36x faster than baseline methods, and capable of adapting to nuanced changes in data distribution. We also conduct human evaluations of the generated summaries and find that CoverSumm is capable of producing informative summaries consistent with the underlying review set.

翻译：抽取式观点摘要旨在通过自动提取代表评论集中主流观点的句子，生成关于某个实体（例如产品评论）的文本摘要。在在线市场中，用户评论通常随时间累积，因此观点摘要需要定期更新，以向消费者提供最新信息。本文研究增量场景下的抽取式观点摘要任务，其中底层评论集随时间动态演变。许多最先进的抽取式观点摘要方法基于中心性，例如CentroidRank（Radev等，2004；Chowdhury等，2022）。CentroidRank通过选择表示空间中距离质心最近的评论句子子集作为摘要，从而实现抽取式摘要。然而，这些方法无法在评论逐条到达的增量场景中高效运行。本文提出一种高效算法，可在增量设置下准确计算CentroidRank摘要。我们的方法CoverSumm基于封面树索引评论表示，并维护候选摘要评论句子的储备池。CoverSumm的有效性得到了运行时间的理论分析与实证验证。在多样化的数据集（包括真实数据及为说明扩展性而人工合成的数据）上，实验表明CoverSumm比基线方法快36倍，并能适应数据分布的细微变化。我们还对生成的摘要进行了人工评估，发现CoverSumm能够生成与底层评论集一致且信息丰富的摘要。