We present RepRank, an unsupervised graph-based ranking model for extractive multi-document summarization in which the similarity between words, sentences, and word-to-sentence can be estimated by the distances between their vector representations in a unified vector space. In order to obtain desirable representations, we propose a self-attention based learning method that represent a sentence by the weighted sum of its word embeddings, and the weights are concentrated to those words hopefully better reflecting the content of a document. We show that salient sentences and keywords can be extracted in a joint and mutual reinforcement process using our learned representations, and prove that this process always converges to a unique solution leading to improvement in performance. A variant of absorbing random walk and the corresponding sampling-based algorithm are also described to avoid redundancy and increase diversity in the summaries. Experiment results with multiple benchmark datasets show that RepRank achieved the best or comparable performance in ROUGE.
翻译:我们提出RepRank,一种基于无监督图排序的抽取式多文档摘要模型,其中单词间、句子间以及单词到句子的相似性均可以通过它们在统一向量空间中的向量表示之间的距离来估计。为获取理想的表示,我们提出一种基于自注意力的学习方法,通过词嵌入的加权和来表示句子,且权重集中于那些能更好反映文档内容的词语。研究表明,利用所学表示,显著句子和关键词可通过联合且相互强化的过程被抽取,并证明该过程始终收敛至唯一解,从而提升性能。我们还描述了一种吸收型随机游走变体及相应的基于采样的算法,以避免摘要冗余、增加多样性。在多个基准数据集上的实验结果显示,RepRank在ROUGE指标上取得了最佳或可比较的性能。