The OpenAIRE graph contains a large citation graph dataset, with over 200 million publications and over 2 billion citations. The current graph is available as a dump with metadata which, when uncompressed, totals $\sim$2.5 TB. This makes it hard to process on conventional computers. To make this network more accessible for the community, we provide a processed OpenAIRE graph which is downscaled to 16 GB RAM, while preserving the full graph structure. Apart from this we offer the processed data in a very simple format, which allows for further straightforward manipulation. We also provide (1) a Python pipeline, which can be used to process the next releases of the OpenAIRE graph, and (2) a larger version of the dataset including more publication fields such as, the title, list of authors.
翻译:OpenAIRE图包含一个大型引文图数据集,涵盖超过2亿篇论文和20亿条引文。当前该图以元数据转储形式提供,解压后总计约2.5 TB,这使得在常规计算机上难以处理。为了让该网络更易于社区访问,我们提供了一个经处理的OpenAIRE图,其RAM占用缩减至16 GB,同时保留了完整的图结构。此外,我们以便于进一步直接操作的极简格式提供处理后的数据。我们还提供了:(1) 一个Python处理流程,可用于处理后续版本的OpenAIRE图;(2) 一个包含更多论文字段(如标题、作者列表)的更大版本数据集。