unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network

Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time coverage, citation network completeness, and representation of full-text content. To address these points, we propose a new version of the data set unarXive. We base our data processing pipeline and output format on two existing data sets, and improve on each of them. Our resulting data set comprises 1.9 M publications spanning multiple disciplines and 32 years. It furthermore has a more complete citation network than its predecessors and retains a richer representation of document structure as well as non-textual publication content such as mathematical notation. In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification. All data and source code is publicly available at https://github.com/IllDepence/unarXive.

翻译：学术出版物的大规模数据集是各类文献计量分析和自然语言处理应用的基础。其中，基于论文全文的数据集近年来尤为引人关注。尽管已有多个此类数据集，但我们在领域与时间覆盖范围、引文网络完整性及全文内容表征方面仍发现关键不足。为解决这些问题，我们提出新版本数据集unarXive。数据处理流程及输出格式基于两个现有数据集构建，并对两者进行了改进。最终生成的数据集涵盖190万篇论文，横跨多个学科领域，时间跨度达32年。该数据集不仅拥有比前代更完整的引文网络，还能更丰富地保留文档结构表征及非文本出版内容（如数学符号）。除数据集外，我们还提供可直接用于引文推荐和IMRaD分类的训练/测试数据。所有数据与源代码均开源发布于https://github.com/IllDepence/unarXive。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

UIUC韩家炜：从海量非结构化文本中挖掘结构化知识

专知会员服务

98+阅读 · 2021年12月30日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日