Tracing the source of research papers is a fundamental yet challenging task for researchers. The billion-scale citation relations between papers hinder researchers from understanding the evolution of science efficiently. To date, there is still a lack of an accurate and scalable dataset constructed by professional researchers to identify the direct source of their studied papers, based on which automatic algorithms can be developed to expand the evolutionary knowledge of science. In this paper, we study the problem of paper source tracing (PST) and construct a high-quality and ever-increasing dataset PST-Bench in computer science. Based on PST-Bench, we reveal several intriguing discoveries, such as the differing evolution patterns across various topics. An exploration of various methods underscores the hardness of PST-Bench, pinpointing potential directions on this topic. The dataset and codes have been available at https://github.com/THUDM/paper-source-trace.
翻译:追踪研究论文的来源是研究人员面临的一项基础性且具有挑战性的任务。论文之间数以亿计的引用关系阻碍了研究者高效理解科学演化的规律。迄今为止,仍缺乏一个由专业研究人员构建的、准确且可扩展的数据集,用以识别所研究论文的直接来源,进而基于该数据集开发自动算法,以拓展科学演化的知识。本文研究了论文来源追踪(PST)问题,并在计算机科学领域构建了一个高质量且持续增长的数据集PST-Bench。基于PST-Bench,我们揭示了若干有趣的发现,例如不同主题间存在各异的演化模式。对多种方法的探索凸显了PST-Bench的难度,并指明了该课题的潜在研究方向。数据集和代码已开源在https://github.com/THUDM/paper-source-trace。