IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research

Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models. In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162X more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging language models and GNN research projects. An early public version of IGB is available at https://github.com/IllinoisGraphBenchmark/IGB-Datasets.

翻译：图神经网络（GNN）在多种现实世界挑战性应用中展现出巨大潜力，但GNN研究的主要障碍之一是缺乏大规模灵活数据集。现有GNN公开数据集大多规模较小，限制了GNN对未观测数据的泛化能力。少数已有的大规模图数据集仅提供非常有限的标注数据，这导致难以判断GNN模型对未观测数据准确率低的原因，究竟是训练数据不足，还是模型本身未能泛化。此外，用于训练GNN的数据集需具备灵活性，以便在训练GNN模型时能深入研究各种因素的影响。本研究提出了伊利诺伊图基准（IGB），这是一款研究人员可用于高保真度训练、审查和系统评估GNN模型的研究数据集工具。IGB包含超大规模的同类图和异构图，其中超过40%的节点带有标注。与公开可用的最大图数据集相比，IGB为深度学习实践者和开发者提供了超162倍的标注数据，以创建和评估更高准确率的模型。IGB数据集设计灵活，支持研究多种GNN架构、嵌入生成技术及分析系统性能问题。IGB已开源，支持DGL和PyG框架，并随附我们相信将促进新兴语言模型与GNN研究项目的原始文本数据。IGB早期公开版本可通过https://github.com/IllinoisGraphBenchmark/IGB-Datasets获取。