Graph Neural Networks (GNNs) benchmarks often report single point estimates, even when performance differences are small relative to variation across random seeds, train/test splits, and datasets. Confidence intervals, paired comparisons, multiple-comparison correction, and rank-based aggregation are standard statistical tools, but they are rarely the default output of graph-learning benchmark suites. We introduce GraphNetz, a benchmarking framework whose default output is a structured statistical report rather than a raw accuracy table. GraphNetz currently includes 63 dataset loaders, four task types, and five canonical GNN architectures, while also supporting custom datasets and models. The framework standardizes multi-seed evaluation and automatically returns per-cell confidence intervals, Holm-corrected paired tests, and Friedman-Nemenyi critical-difference diagrams across tasks. In a cross-category benchmark over ten heterogeneous tasks, apparent rank differences among four canonical node-level encoders fall within a single Nemenyi clique, indicating that none is significantly better than the others at $α= 0.05$. GraphNetz therefore provides researchers with a reproducible computational and statistical pipeline to benchmark new graph-learning methods against standard architectures, over different tasks and a wide set of applications, while reporting principled statistical evidence for benchmarking which accounts for seed uncertainty. This framework is set to serve the graph-learning community with a reproducible and honest model comparison ready to be added to papers.
翻译:图神经网络(GNN)基准测试通常报告单点估计值,即使性能差异相对于随机种子、训练/测试划分以及数据集之间的变异而言较小时也是如此。置信区间、配对比较、多重比较校正和基于秩的聚合是标准统计工具,但它们很少成为图学习基准测试套件的默认输出。我们引入GraphNetz,这是一个基准测试框架,其默认输出是结构化的统计报告,而非原始精度表。GraphNetz目前包含63个数据集加载器、四种任务类型和五种经典GNN架构,同时支持自定义数据集和模型。该框架标准化了多种子评估,并自动返回每单元格置信区间、Holm校正配对检验以及跨任务的Friedman-Nemenyi临界差异图。在跨十个异质性任务的类别交叉基准测试中,四种经典节点级编码器之间的表观秩差异落在单个Nemenyi团内,表明在α=0.05水平上,没有任何一个显著优于其他编码器。因此,GraphNetz为研究人员提供了一条可复现的计算与统计流水线,用于在不同任务和广泛应用中,将新图学习方法与标准架构进行基准测试,同时报告考虑种子不确定性的原则性统计证据以支持基准测试。该框架旨在为图学习社区提供一个可复现且诚实的模型比较方案,可直接用于学术论文。