Keyphrase extraction (KPE) is an important task in Natural Language Processing for many scenarios, which aims to extract keyphrases that are present in a given document. Many existing supervised methods treat KPE as sequential labeling, span-level classification, or generative tasks. However, these methods lack the ability to utilize keyphrase information, which may result in biased results. In this study, we propose Diff-KPE, which leverages the supervised Variational Information Bottleneck (VIB) to guide the text diffusion process for generating enhanced keyphrase representations. Diff-KPE first generates the desired keyphrase embeddings conditioned on the entire document and then injects the generated keyphrase embeddings into each phrase representation. A ranking network and VIB are then optimized together with rank loss and classification loss, respectively. This design of Diff-KPE allows us to rank each candidate phrase by utilizing both the information of keyphrases and the document. Experiments show that Diff-KPE outperforms existing KPE methods on a large open domain keyphrase extraction benchmark, OpenKP, and a scientific domain dataset, KP20K.
翻译:关键词提取(KPE)是自然语言处理领域在众多场景中的重要任务,旨在从给定文档中提取存在的关键短语。现有许多监督方法将KPE视为序列标注、跨度级分类或生成任务,但这些方法缺乏利用关键短语信息的能力,可能导致结果存在偏差。本研究提出Diff-KPE方法,利用受监督的变分信息瓶颈(VIB)引导文本扩散过程,生成增强的关键词短语表示。Diff-KPE首先基于整个文档生成所需的关键词嵌入,随后将这些生成的嵌入注入每个短语表示。优化过程通过排序网络与VIB分别结合排序损失和分类损失共同完成。该设计使Diff-KPE能够同时利用关键短语和文档信息对候选短语进行排序。实验表明,Diff-KPE在大型开放域关键词提取基准OpenKP及科学领域数据集KP20K上均优于现有KPE方法。