This paper proposes a method of abstractive summarization designed to scale to document collections instead of individual documents. Our approach applies a combination of semantic clustering, document size reduction within topic clusters, semantic chunking of a cluster's documents, GPT-based summarization and concatenation, and a combined sentiment and text visualization of each topic to support exploratory data analysis. Statistical comparison of our results to existing state-of-the-art systems BART, BRIO, PEGASUS, and MoCa using ROGUE summary scores showed statistically equivalent performance with BART and PEGASUS on the CNN/Daily Mail test dataset, and with BART on the Gigaword test dataset. This finding is promising since we view document collection summarization as more challenging than individual document summarization. We conclude with a discussion of how issues of scale are
翻译:本文提出一种可扩展至文档集合而非单个文档的抽象式摘要方法。该方法结合了语义聚类、主题簇内的文档规模缩减、聚类文档的语义分块、基于GPT的摘要生成与拼接,以及每个主题的情感与文本可视化组合,以支持探索性数据分析。通过ROUGE摘要评分指标,将我们的结果与现有最先进系统BART、BRIO、PEGASUS和MoCa进行统计比较,结果显示在CNN/Daily Mail测试数据集上与BART和PEGASUS性能统计等价,在Gigaword测试数据集上与BART性能统计等价。这一发现具有积极意义,因为我们视文档集合摘要比单个文档摘要更具挑战性。最后,我们探讨了规模问题如何影响该方法的实际应用。