In this work, we study the problem of unsupervised open-domain keyphrase generation, where the objective is a keyphrase generation model that can be built without using human-labeled data and can perform consistently across domains. To solve this problem, we propose a seq2seq model that consists of two modules, namely \textit{phraseness} and \textit{informativeness} module, both of which can be built in an unsupervised and open-domain fashion. The phraseness module generates phrases, while the informativeness module guides the generation towards those that represent the core concepts of the text. We thoroughly evaluate our proposed method using eight benchmark datasets from different domains. Results on in-domain datasets show that our approach achieves state-of-the-art results compared with existing unsupervised models, and overall narrows the gap between supervised and unsupervised methods down to about 16\%. Furthermore, we demonstrate that our model performs consistently across domains, as it overall surpasses the baselines on out-of-domain datasets.
翻译:本文研究了无监督开放领域关键词生成问题,其目标是构建一个无需人工标注数据且能在不同领域保持稳定性能的关键词生成模型。为解决该问题,我们提出了一种由两个模块构成的序列到序列(seq2seq)模型,即短语性模块与信息性模块,两者均可通过无监督和开放领域方式构建。短语性模块负责生成短语,而信息性模块则引导生成过程聚焦于代表文本核心概念的短语。我们使用来自不同领域的八个基准数据集对所提方法进行了全面评估。域内数据集的结果显示,与现有无监督模型相比,我们的方法取得了最先进的性能,并将监督方法与无监督方法之间的差距整体缩小至约16%。此外,我们证明了该模型在跨领域场景下具有稳定的表现,其在域外数据集上整体优于基线模型。