Pre-trained Language Models for Keyphrase Generation: A Thorough Empirical Study

from arxiv, Technical Report. The contents are published in two separate papers in EMNLP 2023 (arXiv:2310.06374) and LREC-COLING 2024 (arXiv:2402.14052)

Neural models that do not rely on pre-training have excelled in the keyphrase generation task with large annotated datasets. Meanwhile, new approaches have incorporated pre-trained language models (PLMs) for their data efficiency. However, there lacks a systematic study of how the two types of approaches compare and how different design choices can affect the performance of PLM-based models. To fill in this knowledge gap and facilitate a more informed use of PLMs for keyphrase extraction and keyphrase generation, we present an in-depth empirical study. Formulating keyphrase extraction as sequence labeling and keyphrase generation as sequence-to-sequence generation, we perform extensive experiments in three domains. After showing that PLMs have competitive high-resource performance and state-of-the-art low-resource performance, we investigate important design choices including in-domain PLMs, PLMs with different pre-training objectives, using PLMs with a parameter budget, and different formulations for present keyphrases. Further results show that (1) in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models; (2) with a fixed parameter budget, prioritizing model depth over width and allocating more layers in the encoder leads to better encoder-decoder models; and (3) introducing four in-domain PLMs, we achieve a competitive performance in the news domain and the state-of-the-art performance in the scientific domain.

翻译：不依赖预训练的神经模型在拥有大规模标注数据集的关键词生成任务中表现出色。同时，新方法引入了预训练语言模型（PLMs）以提升数据效率。然而，目前缺乏对这两类方法如何比较、以及不同设计选择如何影响基于PLM模型性能的系统性研究。为填补这一知识空白并促进对PLM用于关键词抽取和关键词生成的更明智使用，我们提出了一项深入的实证研究。将关键词抽取形式化为序列标注任务、关键词生成为序列到序列生成任务，我们在三个领域进行了广泛实验。在证明PLM在高资源场景下具有竞争力、在低资源场景下达到最先进性能后，我们进一步研究了关键设计选择，包括领域内PLM、具有不同预训练目标的PLM、使用参数预算约束的PLM，以及针对现有关键词的不同形式化方法。进一步结果显示：（1）领域内类BERT的PLM可用于构建强大且数据高效的关键词生成模型；（2）在固定参数预算下，优先增加模型深度而非宽度，并在编码器中分配更多层可得到更优的编码器-解码器模型；（3）引入四种领域内PLM，我们在新闻领域取得具有竞争力的性能，并在科学领域达到最先进性能。