Subgraph-based graph representation learning (SGRL) has recently emerged as a powerful tool in many prediction tasks on graphs due to its advantages in model expressiveness and generalization ability. Most previous SGRL models face computational issues associated with the high cost of extracting subgraphs for each training or testing query. Recently, SUREL has been proposed as a new framework to accelerate SGRL, which samples random walks offline and joins these walks as subgraphs online for prediction. Due to the reusability of sampled walks across different queries, SUREL achieves state-of-the-art performance in both scalability and prediction accuracy. However, SUREL still suffers from high computational overhead caused by node redundancy in sampled walks. In this work, we propose a novel framework SUREL+ that upgrades SUREL by using node sets instead of walks to represent subgraphs. This set-based representation avoids node duplication by definition, but the sizes of node sets can be irregular. To address this issue, we design a dedicated sparse data structure to efficiently store and fast index node sets, and provide a specialized operator to join them in parallel batches. SUREL+ is modularized to support multiple types of set samplers, structural features, and neural encoders to complement the loss of structural information due to the reduction from walks to sets. Extensive experiments have been performed to validate SUREL+ in the prediction tasks of links, relation types, and higher-order patterns. SUREL+ achieves 3-11$\times$ speedups of SUREL while maintaining comparable or even better prediction performance; compared to other SGRL baselines, SUREL+ achieves $\sim$20$\times$ speedups and significantly improves the prediction accuracy.
翻译:基于子图的图表示学习(SGRL)近期因在模型表达力和泛化能力上的优势,成为图预测任务中的强大工具。现有SGRL模型普遍面临为每次训练或测试查询提取子图的高昂计算开销问题。最近提出的SUREL框架通过离线采样随机游走、在线拼接游走形成子图的方式加速SGRL,其采样游走可跨查询复用,在可扩展性与预测精度上均达到最优水平。然而,采样游走中的节点冗余仍导致SUREL存在高计算开销。本文提出新型框架SUREL+,通过采用节点集合替代游走来表示子图,对SUREL进行升级。这种基于集合的表示方式从根本上避免了节点重复,但节点集合的尺寸不规整。为解决该问题,我们设计专用稀疏数据结构高效存储与快速索引节点集合,并提供专用算子进行并行批处理拼接。SUREL+采用模块化设计,支持多种集合采样器、结构特征和神经编码器,弥补从游走到集合的降维带来的结构信息损失。通过链接预测、关系类型预测和高阶模式预测等任务的大量实验验证,SUREL+在保持相当甚至更优预测性能的同时,相比SUREL实现3-11倍加速;与其他SGRL基线方法相比,SUREL+实现约20倍加速并显著提升预测精度。