Keyphrase generation (KG) aims to generate a set of summarizing words or phrases given a source document, while keyphrase extraction (KE) aims to identify them from the text. Because the search space is much smaller in KE, it is often combined with KG to predict keyphrases that may or may not exist in the corresponding document. However, current unified approaches adopt sequence labeling and maximization-based generation that primarily operate at a token level, falling short in observing and scoring keyphrases as a whole. In this work, we propose SimCKP, a simple contrastive learning framework that consists of two stages: 1) An extractor-generator that extracts keyphrases by learning context-aware phrase-level representations in a contrastive manner while also generating keyphrases that do not appear in the document; 2) A reranker that adapts scores for each generated phrase by likewise aligning their representations with the corresponding document. Experimental results on multiple benchmark datasets demonstrate the effectiveness of our proposed approach, which outperforms the state-of-the-art models by a significant margin.
翻译:关键短语生成(KG)旨在根据源文档生成一组概括性词语或短语,而关键短语提取(KE)则旨在从文本中识别这些短语。由于KE的搜索空间小得多,它常与KG结合,以预测文档中可能存在或不存在的关键短语。然而,当前的统一方法采用序列标注和基于最大化的生成,主要在token级别操作,无法从整体上观察和评分关键短语。本文提出SimCKP,一个简单的对比学习框架,包含两个阶段:1)抽取-生成器,通过对比学习方式获取上下文感知的短语级表示来提取关键短语,同时生成文档中未出现的关键短语;2)重排序器,通过类似地将每个生成短语的表示与对应文档对齐来调整其得分。在多个基准数据集上的实验结果表明,所提方法有效,显著超越了现有最优模型。