The keyphrase extraction task refers to the automatic selection of phrases from a given document to summarize its core content. State-of-the-art (SOTA) performance has recently been achieved by embedding-based algorithms, which rank candidates according to how similar their embeddings are to document embeddings. However, such solutions either struggle with the document and candidate length discrepancies or fail to fully utilize the pre-trained language model (PLM) without further fine-tuning. To this end, in this paper, we propose a simple yet effective unsupervised approach, PromptRank, based on the PLM with an encoder-decoder architecture. Specifically, PromptRank feeds the document into the encoder and calculates the probability of generating the candidate with a designed prompt by the decoder. We extensively evaluate the proposed PromptRank on six widely used benchmarks. PromptRank outperforms the SOTA approach MDERank, improving the F1 score relatively by 34.18%, 24.87%, and 17.57% for 5, 10, and 15 returned results, respectively. This demonstrates the great potential of using prompt for unsupervised keyphrase extraction. We release our code at https://github.com/HLT-NLP/PromptRank.
翻译:关键短语抽取任务是指从给定文档中自动选择短语以总结其核心内容。近期,基于嵌入的算法取得了最先进的性能,该类算法根据候选短语与文档嵌入的相似度进行排序。然而,此类方法或难以处理文档与候选短语长度差异的问题,或未能充分利用预训练语言模型(PLM)而未进行进一步微调。为此,本文提出一种简单有效的无监督方法——PromptRank,该方法基于编码器-解码器架构的PLM。具体而言,PromptRank将文档输入编码器,并通过解码器计算在预设提示下生成候选短语的概率。我们在六个广泛使用的基准数据集上对PromptRank进行了全面评估。结果表明,PromptRank超越了当前最优方法MDERank,在返回结果数为5、10和15时,F1分数分别相对提升了34.18%、24.87%和17.57%。这充分展示了利用提示进行无监督关键短语抽取的巨大潜力。我们的代码已开源至https://github.com/HLT-NLP/PromptRank。