Keyphrase extraction aims at automatically extracting a list of "important" phrases representing the key concepts in a document. Prior approaches for unsupervised keyphrase extraction resorted to heuristic notions of phrase importance via embedding clustering or graph centrality, requiring extensive domain expertise. Our work presents a simple alternative approach which defines keyphrases as document phrases that are salient for predicting the topic of the document. To this end, we propose INSPECT -- an approach that uses self-explaining models for identifying influential keyphrases in a document by measuring the predictive impact of input phrases on the downstream task of the document topic classification. We show that this novel method not only alleviates the need for ad-hoc heuristics but also achieves state-of-the-art results in unsupervised keyphrase extraction in four datasets across two domains: scientific publications and news articles.
翻译:关键词提取旨在自动提取代表文档关键概念的一系列“重要”短语。早前的非监督关键词提取方法依赖于嵌入聚类或图中心性等启发式短语重要性概念,这需要大量领域专业知识。本工作提出了一种简单的替代方法,将关键词定义为对预测文档主题具有显著贡献的文档短语。为此,我们提出INSPECT——一种使用自解释模型的方法,通过衡量输入短语在下游文档主题分类任务中的预测影响度来识别文档中的影响力关键词。研究表明,该新颖方法不仅消除了对特定领域启发式规则的依赖,更在科学文献和新闻文章两个领域的四个数据集中取得了非监督关键词提取的最新最优成果。