GraphNetz: Statistical Benchmarking of Graph Neural Networks with Paired Tests and Rank Aggregation

Graph Neural Networks (GNNs) benchmarks often report single point estimates, even when performance differences are small relative to variation across random seeds, train/test splits, and datasets. Confidence intervals, paired comparisons, multiple-comparison correction, and rank-based aggregation are standard statistical tools, but they are rarely the default output of graph-learning benchmark suites. We introduce GraphNetz, a benchmarking framework whose default output is a structured statistical report rather than a raw accuracy table. GraphNetz currently includes 63 dataset loaders, four task types, and five canonical GNN architectures, while also supporting custom datasets and models. The framework standardizes multi-seed evaluation and automatically returns per-cell confidence intervals, Holm-corrected paired tests, and Friedman-Nemenyi critical-difference diagrams across tasks. In a cross-category benchmark over ten heterogeneous tasks, apparent rank differences among four canonical node-level encoders fall within a single Nemenyi clique, indicating that none is significantly better than the others at $α= 0.05$. GraphNetz therefore provides researchers with a reproducible computational and statistical pipeline to benchmark new graph-learning methods against standard architectures, over different tasks and a wide set of applications, while reporting principled statistical evidence for benchmarking which accounts for seed uncertainty. This framework is set to serve the graph-learning community with a reproducible and honest model comparison ready to be added to papers.

翻译：图神经网络（GNN）基准测试通常报告单点估计值，即使性能差异相对于随机种子、训练/测试划分以及数据集之间的变异而言较小时也是如此。置信区间、配对比较、多重比较校正和基于秩的聚合是标准统计工具，但它们很少成为图学习基准测试套件的默认输出。我们引入GraphNetz，这是一个基准测试框架，其默认输出是结构化的统计报告，而非原始精度表。GraphNetz目前包含63个数据集加载器、四种任务类型和五种经典GNN架构，同时支持自定义数据集和模型。该框架标准化了多种子评估，并自动返回每单元格置信区间、Holm校正配对检验以及跨任务的Friedman-Nemenyi临界差异图。在跨十个异质性任务的类别交叉基准测试中，四种经典节点级编码器之间的表观秩差异落在单个Nemenyi团内，表明在α=0.05水平上，没有任何一个显著优于其他编码器。因此，GraphNetz为研究人员提供了一条可复现的计算与统计流水线，用于在不同任务和广泛应用中，将新图学习方法与标准架构进行基准测试，同时报告考虑种子不确定性的原则性统计证据以支持基准测试。该框架旨在为图学习社区提供一个可复现且诚实的模型比较方案，可直接用于学术论文。