Keyphrase extraction is a fundamental task in natural language processing and information retrieval that aims to extract a set of phrases with important information from a source document. Identifying important keyphrase is the central component of the keyphrase extraction task, and its main challenge is how to represent information comprehensively and discriminate importance accurately. In this paper, to address these issues, we design a new hyperbolic matching model (HyperMatch) to represent phrases and documents in the same hyperbolic space and explicitly estimate the phrase-document relevance via the Poincar\'e distance as the important score of each phrase. Specifically, to capture the hierarchical syntactic and semantic structure information, HyperMatch takes advantage of the hidden representations in multiple layers of RoBERTa and integrates them as the word embeddings via an adaptive mixing layer. Meanwhile, considering the hierarchical structure hidden in the document, HyperMatch embeds both phrases and documents in the same hyperbolic space via a hyperbolic phrase encoder and a hyperbolic document encoder. This strategy can further enhance the estimation of phrase-document relevance due to the good properties of hyperbolic space. In this setting, the keyphrase extraction can be taken as a matching problem and effectively implemented by minimizing a hyperbolic margin-based triplet loss. Extensive experiments are conducted on six benchmarks and demonstrate that HyperMatch outperforms the state-of-the-art baselines.
翻译:关键短语抽取是自然语言处理和信息检索中的一项基础任务,旨在从源文档中提取一组包含重要信息的短语。识别重要关键短语是该任务的核心环节,其主要挑战在于如何全面表征信息并准确区分重要性。为解决这些问题,本文设计了一种新的双曲匹配模型(HyperMatch),将短语和文档表示在同一双曲空间中,并通过庞加莱距离显式估计短语-文档相关性,将其作为每个短语的重要度评分。具体而言,为捕获层级化的句法与语义结构信息,HyperMatch利用RoBERTa多层隐藏表示,并通过自适应混合层将其整合为词嵌入。同时,考虑到文档中隐藏的层级结构,HyperMatch通过双曲短语编码器和双曲文档编码器将短语和文档嵌入同一双曲空间。这一策略借助双曲空间的优良特性,进一步增强了短语-文档相关性的估计能力。在此框架下,关键短语抽取可被视为一个匹配问题,并通过最小化基于双曲边界的三元组损失有效实现。在六个基准数据集上的大量实验表明,HyperMatch性能优于当前最先进的基线模型。