Automatic keyphrase labelling stands for the ability of models to retrieve words or short phrases that adequately describe documents' content. Previous work has put much effort into exploring extractive techniques to address this task; however, these methods cannot produce keyphrases not found in the text. Given this limitation, keyphrase generation approaches have arisen lately. This paper presents a keyphrase generation model based on the Text-to-Text Transfer Transformer (T5) architecture. Having a document's title and abstract as input, we learn a T5 model to generate keyphrases which adequately define its content. We name this model docT5keywords. We not only perform the classic inference approach, where the output sequence is directly selected as the predicted values, but we also report results from a majority voting approach. In this approach, multiple sequences are generated, and the keyphrases are ranked based on their frequency of occurrence across these sequences. Along with this model, we present a novel keyphrase filtering technique based on the T5 architecture. We train a T5 model to learn whether a given keyphrase is relevant to a document. We devise two evaluation methodologies to prove our model's capability to filter inadequate keyphrases. First, we perform a binary evaluation where our model has to predict if a keyphrase is relevant for a given document. Second, we filter the predicted keyphrases by several AKG models and check if the evaluation scores are improved. Experimental results demonstrate that our keyphrase generation model significantly outperforms all the baselines, with gains exceeding 100\% in some cases. The proposed filtering technique also achieves near-perfect accuracy in eliminating false positives across all datasets.
翻译:自动关键词标注指的是模型检索能够充分描述文档内容的单词或短语的能力。先前的研究已投入大量精力探索抽取式技术以解决此任务;然而,这些方法无法生成文本中未出现的关键词。鉴于这一局限性,关键词生成方法近年来逐渐兴起。本文提出了一种基于文本到文本迁移Transformer(T5)架构的关键词生成模型。以文档标题和摘要作为输入,我们训练一个T5模型来生成能够充分定义其内容的关键词。我们将该模型命名为docT5keywords。我们不仅采用了经典的推理方法(直接将输出序列作为预测值),还报告了基于多数投票方法的结果。在该方法中,我们生成多个序列,并根据关键词在这些序列中出现的频率对其进行排序。除了该模型,我们还提出了一种基于T5架构的新型关键词过滤技术。我们训练了一个T5模型来学习判断给定关键词是否与文档相关。我们设计了两种评估方法来验证模型过滤不相关关键词的能力:首先,我们进行二元评估,要求模型预测某个关键词是否与给定文档相关;其次,我们对多个自动关键词生成模型的预测结果进行过滤,并检验评估指标是否得到提升。实验结果表明,我们的关键词生成模型显著优于所有基线方法,在某些情况下性能提升超过100%。所提出的过滤技术在消除所有数据集中的误报方面也达到了接近完美的准确率。