Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time coverage, citation network completeness, and representation of full-text content. To address these points, we propose a new version of the data set unarXive. We base our data processing pipeline and output format on two existing data sets, and improve on each of them. Our resulting data set comprises 1.9 M publications spanning multiple disciplines and 32 years. It furthermore has a more complete citation network than its predecessors and retains a richer representation of document structure as well as non-textual publication content such as mathematical notation. In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification. All data and source code is publicly available at https://github.com/IllDepence/unarXive.
翻译:学术出版物的大规模数据集是各类文献计量分析和自然语言处理应用的基础。其中,基于论文全文的数据集近年来尤为引人关注。尽管已有多个此类数据集,但我们在领域与时间覆盖范围、引文网络完整性及全文内容表征方面仍发现关键不足。为解决这些问题,我们提出新版本数据集unarXive。数据处理流程及输出格式基于两个现有数据集构建,并对两者进行了改进。最终生成的数据集涵盖190万篇论文,横跨多个学科领域,时间跨度达32年。该数据集不仅拥有比前代更完整的引文网络,还能更丰富地保留文档结构表征及非文本出版内容(如数学符号)。除数据集外,我们还提供可直接用于引文推荐和IMRaD分类的训练/测试数据。所有数据与源代码均开源发布于https://github.com/IllDepence/unarXive。