Pre-trained language models are increasingly being used in multi-document summarization tasks. However, these models need large-scale corpora for pre-training and are domain-dependent. Other non-neural unsupervised summarization approaches mostly rely on key sentence extraction, which can lead to information loss. To address these challenges, we propose a lightweight yet effective unsupervised approach called GLIMMER: a Graph and LexIcal features based unsupervised Multi-docuMEnt summaRization approach. It first constructs a sentence graph from the source documents, then automatically identifies semantic clusters by mining low-level features from raw texts, thereby improving intra-cluster correlation and the fluency of generated sentences. Finally, it summarizes clusters into natural sentences. Experiments conducted on Multi-News, Multi-XScience and DUC-2004 demonstrate that our approach outperforms existing unsupervised approaches. Furthermore, it surpasses state-of-the-art pre-trained multi-document summarization models (e.g. PEGASUS and PRIMERA) under zero-shot settings in terms of ROUGE scores. Additionally, human evaluations indicate that summaries generated by GLIMMER achieve high readability and informativeness scores. Our code is available at https://github.com/Oswald1997/GLIMMER.
翻译:预训练语言模型正日益广泛地应用于多文档摘要任务。然而,这些模型需要大规模语料进行预训练,且具有领域依赖性。其他非神经的无监督摘要方法主要依赖于关键句子抽取,这可能导致信息丢失。为应对这些挑战,我们提出了一种轻量级且高效的无监督方法GLIMMER:一种基于图特征与词汇特征的无监督多文档摘要方法。该方法首先从源文档构建句子图,随后通过从原始文本挖掘底层特征自动识别语义簇,从而提升簇内相关性与生成句子的流畅度。最后,它将各簇汇总为自然语句。在Multi-News、Multi-XScience和DUC-2004数据集上的实验表明,本方法优于现有的无监督方法。此外,在零样本设置下,其ROUGE分数超越了当前最先进的预训练多文档摘要模型(如PEGASUS和PRIMERA)。人工评估进一步显示,GLIMMER生成的摘要具有较高的可读性与信息性评分。我们的代码公开于https://github.com/Oswald1997/GLIMMER。