IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research

Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models. In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous academic graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162X more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is a collection of academic graphs designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues for node classification tasks. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging language models and GNN research projects. An early public version of IGB is available at https://github.com/IllinoisGraphBenchmark/IGB-Datasets.

翻译：图神经网络（GNN）在多种现实挑战性应用中展现出巨大潜力，但其研究面临的主要障碍之一是缺乏大规模灵活数据集。现有针对GNN的公共数据集大多规模较小，限制了模型对未标注数据的泛化能力。少数大规模图数据集提供的标注数据极为有限，这使得难以判断GNN模型对未标注数据的低准确率本质上源于训练数据不足，还是模型未能成功泛化。此外，用于训练GNN的数据集需具备灵活性，以支持在训练过程中系统研究各类因素影响。本文提出伊利诺伊图基准（IGB）——一种高保真度的研究数据集工具，开发者可利用其训练、审查并系统评估GNN模型。IGB包含超大规模的同质与异质学术图，其中超过40%的节点带有标注。与现有最大规模公开图数据集相比，IGB为深度学习实践者和开发者提供了高出162倍以上的标注数据，助力创建与评估更高精度的模型。IGB数据集由一系列灵活设计的学术图构成，支持研究多种GNN架构、嵌入生成技术，并分析节点分类任务中的系统性能问题。该数据集全面开源，兼容DGL和PyG框架，同时发布原始文本资源，我们认为这将促进新兴语言模型与GNN研究项目的发展。IGB早期公开版本可于https://github.com/IllinoisGraphBenchmark/IGB-Datasets获取。