State-of-the-art models for keyphrase generation require large amounts of training data to achieve good performance. However, obtaining keyphrase-labeled documents can be challenging and costly. To address this issue, we present a self-compositional data augmentation method. More specifically, we measure the relatedness of training documents based on their shared keyphrases, and combine similar documents to generate synthetic samples. The advantage of our method lies in its ability to create additional training samples that keep domain coherence, without relying on external data or resources. Our results on multiple datasets spanning three different domains, demonstrate that our method consistently improves keyphrase generation. A qualitative analysis of the generated keyphrases for the Computer Science domain confirms this improvement towards their representativity property.
翻译:当前最先进的关键词生成模型需要大量训练数据才能实现良好性能。然而,获取带有关键词标注的文档既具有挑战性又成本高昂。为解决这一问题,我们提出了一种自组合数据增强方法。具体而言,我们基于训练文档共享的关键词来衡量其相关性,并通过组合相似文档来生成合成样本。本方法的优势在于能够在不依赖外部数据或资源的情况下,创建保持领域一致性的额外训练样本。我们在涵盖三个不同领域的多个数据集上的实验结果表明,该方法能持续提升关键词生成性能。针对计算机科学领域生成关键词的定性分析进一步证实了其在代表性特征方面的改进。