The citation graph is essential for generating high-quality summaries of scientific papers, in which references of a scientific paper and their correlations provide extra knowledge for understanding its background and main contributions. Despite the promising role of the citation graph, effectively incorporating it still remains a big challenge, given the difficulty of accurately identifying and leveraging relevant contents in references for a source paper, as well as modelling their correlations of different intensities. Existing methods either ignore or utilize only abstracts indiscriminately from references, failing to tackle the challenge mentioned above. To fill the gap, we propose a novel citation-aware scientific paper summarization framework based on the citation graph, with the ability to accurately locate and incorporate the salient contents from references, as well as capture varying relevance between source papers and their references. Specifically, we first build a domain-specific dataset PubMedCite with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient contents of references and the source paper. Based on it, we design a self-supervised citation-aware summarization framework (CitationSum) with graph contrastive learning, which boosts the summarization generation by efficiently fusing the salient information in references with source paper contents under the guidance of their correlations. Experimental results show that our model outperforms the state-of-the-art methods, due to efficiently leveraging the information of references and citation correlations.
翻译:引文图对于生成高质量的科学论文摘要至关重要,其中参考文献及其关联为理解论文背景和主要贡献提供了额外知识。尽管引文图具有良好前景,但如何有效整合它仍是一大挑战——这源于难以准确识别并利用源论文参考文献中的相关内容,以及建模它们之间不同强度的关联。现有方法要么忽略参考文献,要么不加区分地仅利用其摘要,未能解决上述难题。为此,我们提出一种基于引文图的新型引文感知科学论文摘要生成框架,能够准确定位并整合参考文献中的显著内容,同时捕捉源论文与参考文献之间的可变相关性。具体而言,我们首先构建了包含约19.2万篇生物医学论文的领域专用数据集PubMedCite,并保留其91.7万条引文关系的大规模引文图。该数据集的特点是保留了从参考文献全文中提取的显著内容,以及这些显著内容与源论文之间的加权关联。在此基础上,我们设计了一种基于图对比学习的自监督引文感知摘要框架(CitationSum),通过在其关联引导下高效融合参考文献的显著信息与源论文内容,提升摘要生成质量。实验结果表明,由于有效利用了参考文献信息和引文关联,我们的模型优于现有最先进方法。