Keyphrase extraction (KPE) is an important task in Natural Language Processing for many scenarios, which aims to extract keyphrases that are present in a given document. Many existing supervised methods treat KPE as sequential labeling, span-level classification, or generative tasks. However, these methods lack the ability to utilize keyphrase information, which may result in biased results. In this study, we propose Diff-KPE, which leverages the supervised Variational Information Bottleneck (VIB) to guide the text diffusion process for generating enhanced keyphrase representations. Diff-KPE first generates the desired keyphrase embeddings conditioned on the entire document and then injects the generated keyphrase embeddings into each phrase representation. A ranking network and VIB are then optimized together with rank loss and classification loss, respectively. This design of Diff-KPE allows us to rank each candidate phrase by utilizing both the information of keyphrases and the document. Experiments show that Diff-KPE outperforms existing KPE methods on a large open domain keyphrase extraction benchmark, OpenKP, and a scientific domain dataset, KP20K.
翻译:关键短语提取(KPE)是自然语言处理中一项重要任务,广泛应用于多种场景,旨在从给定文档中提取存在的关键短语。许多现有的监督方法将KPE视为序列标注、跨度级分类或生成任务。然而,这些方法缺乏利用关键短语信息的能力,可能导致结果存在偏差。在本研究中,我们提出Diff-KPE,该方法利用监督变分信息瓶颈(VIB)引导文本扩散过程,以生成增强的关键短语表示。Diff-KPE首先在整篇文档条件下生成所需的关键短语嵌入,然后将生成的关键短语嵌入注入到每个短语表示中。随后,排序网络和VIB分别通过排序损失和分类损失进行联合优化。这种Diff-KPE设计使我们能够同时利用关键短语和文档的信息对每个候选短语进行排序。实验表明,在大规模开放域关键短语提取基准OpenKP和科学领域数据集KP20K上,Diff-KPE优于现有的KPE方法。