Enhancing Automatic Keyphrase Labelling with Text-to-Text Transfer Transformer (T5) Architecture: A Framework for Keyphrase Generation and Filtering

Automatic keyphrase labelling stands for the ability of models to retrieve words or short phrases that adequately describe documents' content. Previous work has put much effort into exploring extractive techniques to address this task; however, these methods cannot produce keyphrases not found in the text. Given this limitation, keyphrase generation approaches have arisen lately. This paper presents a keyphrase generation model based on the Text-to-Text Transfer Transformer (T5) architecture. Having a document's title and abstract as input, we learn a T5 model to generate keyphrases which adequately define its content. We name this model docT5keywords. We not only perform the classic inference approach, where the output sequence is directly selected as the predicted values, but we also report results from a majority voting approach. In this approach, multiple sequences are generated, and the keyphrases are ranked based on their frequency of occurrence across these sequences. Along with this model, we present a novel keyphrase filtering technique based on the T5 architecture. We train a T5 model to learn whether a given keyphrase is relevant to a document. We devise two evaluation methodologies to prove our model's capability to filter inadequate keyphrases. First, we perform a binary evaluation where our model has to predict if a keyphrase is relevant for a given document. Second, we filter the predicted keyphrases by several AKG models and check if the evaluation scores are improved. Experimental results demonstrate that our keyphrase generation model significantly outperforms all the baselines, with gains exceeding 100\% in some cases. The proposed filtering technique also achieves near-perfect accuracy in eliminating false positives across all datasets.

翻译：自动关键词标注指的是模型检索能够充分描述文档内容的单词或短语的能力。先前的研究已投入大量精力探索抽取式技术以解决此任务；然而，这些方法无法生成文本中未出现的关键词。鉴于这一局限性，关键词生成方法近年来逐渐兴起。本文提出了一种基于文本到文本迁移Transformer（T5）架构的关键词生成模型。以文档标题和摘要作为输入，我们训练一个T5模型来生成能够充分定义其内容的关键词。我们将该模型命名为docT5keywords。我们不仅采用了经典的推理方法（直接将输出序列作为预测值），还报告了基于多数投票方法的结果。在该方法中，我们生成多个序列，并根据关键词在这些序列中出现的频率对其进行排序。除了该模型，我们还提出了一种基于T5架构的新型关键词过滤技术。我们训练了一个T5模型来学习判断给定关键词是否与文档相关。我们设计了两种评估方法来验证模型过滤不相关关键词的能力：首先，我们进行二元评估，要求模型预测某个关键词是否与给定文档相关；其次，我们对多个自动关键词生成模型的预测结果进行过滤，并检验评估指标是否得到提升。实验结果表明，我们的关键词生成模型显著优于所有基线方法，在某些情况下性能提升超过100%。所提出的过滤技术在消除所有数据集中的误报方面也达到了接近完美的准确率。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日