The citation graph is essential for generating high-quality summaries of scientific papers, in which references of a scientific paper and their correlations provide extra knowledge for understanding its background and main contributions. Despite the promising role of the citation graph, effectively incorporating it still remains a big challenge, given the difficulty of accurately identifying and leveraging relevant contents in references for a source paper, as well as modelling their correlations of different intensities. Existing methods either ignore or utilize only abstracts indiscriminately from references, failing to tackle the challenge mentioned above. To fill the gap, we propose a novel citation-aware scientific paper summarization framework based on the citation graph, with the ability to accurately locate and incorporate the salient contents from references, as well as capture varying relevance between source papers and their references. Specifically, we first build a domain-specific dataset PubMedCite with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient contents of references and the source paper. Based on it, we design a self-supervised citation-aware summarization framework (CitationSum) with graph contrastive learning, which boosts the summarization generation by efficiently fusing the salient information in references with source paper contents under the guidance of their correlations. Experimental results show that our model outperforms the state-of-the-art methods, due to efficiently leveraging the information of references and citation correlations.
翻译:引文图对于生成高质量的科学论文摘要至关重要,其中科学论文的参考文献及其相互关联为理解论文背景和主要贡献提供了额外知识。尽管引文图具有潜在作用,但由于难以准确识别和利用参考文献中与源论文相关的内容,以及建模不同强度的关联关系,有效整合引文图仍然是一个重大挑战。现有方法要么忽略参考文献,要么不加区分地仅利用其摘要,未能解决上述难题。为弥补这一空白,我们提出了一种基于引文图的引文感知科学论文摘要生成框架,能够准确定位并整合参考文献的突出内容,同时捕捉源论文与参考文献之间的不同相关性。具体而言,我们首先构建了领域特定数据集PubMedCite,包含约19.2万篇生物医学科学论文及其91.7万条引文关系。该数据集的特点是保留了从参考文献全文中提取的突出内容,以及这些突出内容与源论文之间的加权关联。基于此,我们设计了一个带图对比学习的自监督引文感知摘要生成框架(CitationSum),通过高效融合参考文献的突出信息与源论文内容(在关联的引导下)来增强摘要生成。实验结果表明,由于有效利用了参考文献信息及引文相关性,我们的模型性能优于现有最先进方法。